Tasks

=Task 1: Determining the RISC ISA= toc

Assembler
A two pass assembler has been created, using Java, in order to process our bubble-sort assembly program into executable binaries. In the first pass we resolve all ISA symbols into an instruction symbols table. In the second pass, we generate two MIF files: a data memory file containing array size and contents, and an instruction memory file to be executed by the processor. This version of the assembler was written to complete the first task, and will most likely be modified for further tasks.

**Bubble Sort Algorithm**
A basic bubble sort algorithm has been written in assembly within the scope of the ISA detailed below. In order to better understand the assembly architecture of the algorithm, a version was written in C and then compiled to an assembly level. An analysis of this complicated compilation led our implementation of the limited MIPS version, which proved much easier to develop. Optimization of the assembly code has not been pursued at this time.

ISA and Processor Architecture Changes
The ISA at the finish of Task 1 is shown below. Additions to this ISA are still possible during the completion of other tasks. Concerning Processor Architecture, the Assembler and Bubble Sort were written for the Single Stage Processor (with slight modifications) that was distributed to the class. This processor architecture will inherently change over the coming tasks, which require a more formal implementation.


 * ~ R-Type ||~  ||~   ||
 * **add** || $d,$s,$t || $d=$s+$t ||
 * **and** || $d,$s,$t || $d=$s&&$t ||
 * **or** || $d,$s,$t || $d=$s|$t ||
 * **slt** || $d,$s,$t || $d=($s<$t) ||
 * **sub** || $d,$s,$t || $d=$s-$t ||
 * ~ I-Type ||~  ||~   ||
 * **addi** || $s,$t,C || $s=$t+C ||
 * **beq** || $s,$t,C || ($s=$t) PC=PC+C ||
 * **lw** || $t,C($s) || $t=Mem[$s+C] ||
 * **sw** || $t,C($s) || Mem[C+$s]=$t ||
 * **sw** || $t,C($s) || Mem[C+$s]=$t ||

=Task 2: Pipelining the Processor=

Verilog
All previous processor code (i.e. the example processor in VHDL with modified elements) has been scrapped in favor of a fresh start using the Verilog language. Thus the processor that will be pipelined will be written in verilog, not VHDL.

ISA Reduction
Our ISA has been reduced from our original set, given that several instructions were never used or not necessary. We have removed AND, OR and SUB from our supported instructions. Our current ISA can be seen in tabular form on our homepage.

Pipelining
Our first step in pipelining our new Verilog processor was adding several register modules, namely: FetchDecode, DecodeExecute, ExecuteMemory and MemoryWriteBack. We also created a WriteBack module given that our previous implementation did not require the logic contained within to be separate from the Memory module.





Forwarding
Below is a diagram of the forwarding implementation we used in our pipelined processor. Output and instruction values from the execute and memory registers were funneled back into the EXECUTE block in order to determine forwarding logic.



=Task 3: Adding a Static Branch Predictor=

Stalls & Flushes
Realizing that we had already implemented a portion of static branch prediction for the previous task (Pipelining the Processor), the only functionality left to implement was stalling and flushing (or missed/incorrect branch prediction handling). We added a Control unit to our processor to house the static branch prediction logic.



Performance Counters
In order to monitor the performance of our branch predictor we assigned the "correct branch" statistic and the "total number of branches" statistic to two separate locations in memory. Each time a branch occurs, the "total number of branches" is incremented by one, and likewise for each time a branch is not taken (or predicted correctly) the "correct branch" is incremented. At the end of program execution, we will output the two values for statistical analysis.

Visualization Organization
After pipelining, adding forwarding, and adding the static branch predictor, our block diagram had become quite messy and hard to use. We spent a few minutes tidying it up, and now its much easier to follow/read.



=Task 4: Incorporating Instruction/Data Caches=

== ==

Cache Implementation
From a design standpoint, we have chosen to implement our caches alongside the memory module of our processor, thus attempting to eliminate any unnecessary confusion by adding yet another component to our design schematic. The logic behind our cache comes from examples in our textbook as well as in the example materials provided on the course homepage.

Cache Organization
For the organization of our caches, we decided to follow the recommended specifications given in the project description. Thus we have a 16 set cache scheme, with each block having 16 bytes. At this point we haven't experienced any performance difficulties with this implementation. But if it appears that this scheme performs extremely poorly for any configuration of our code or incoming/cached data, we will re-evaluate its implementation.

Write-Through vs. Write-Back
It was decided that we would implement a write-through design in order to maintain the simplest execution of our cache.