/// LSU EE 3755 -- Fall 2013 -- Computer Organization
/// Note Set 13 -- Our MIPS Implementations

// Time-stamp: <5 December 2013, 10:18:05 CST, koppel@sky.ece.lsu.edu>

/// Contents
// Class MIPS Implementations
// Strawman MIPS Implementation
// Hardwired Control MIPS Implementation
// Microcoded Control MIPS Implementation
// Other Implementations

/// References
// :PH:  Patterson & Hennessy, "Computer Organization & Design"
// :Mv1: MIPS Technologies, "MIPS32 Architecture for Programmers Vol I: Intro"
// :Mv2: MIPS Technologies, "MIPS32 Architecture for Programmers Vol II: Instr"

/// Class MIPS Implementations

 //  For recent copies see: http://www.ece.lsu.edu/ee3755/ln.html

 /// Summary of Implementation Types

 //  All Verilog descriptions are synthesizable.
 //  They execute only a subset of MIPS instructions.

 /// Functional Simulator   (EE 3755)  2013
//   http://www.ece.lsu.edu/ee3755/2013f/mips_fs.v.html
//   Not intended for synthesis, but synthesizable nevertheless.
//   Illustrates basic implementation techniques.
//   Can be used to validate other models.

 /// Very Simple MIPS Implementation (EE 3755) 2013
//   Lectures: http://www.ece.lsu.edu/ee3755/2013f/mips-hw.pdf
//   Verilog:  http://www.ece.lsu.edu/ee3755/2013f/mips_vsi.v.html
//   Low cost, very simple design, unnecessarily low performance.
//   Verilog style closely matches synthesized hardware.
//   The Very Simple MIPS implementation is as simple as possible but
//   is still synthesizable. The ALU is used for as much as possible,
//   including computing branch conditions, branch targets, and
//   non-branch incrementing of the PC (actually NPC). Also to keep
//   things simple, as much work as possible is done in one
//   cycle. Packing a lot of work in one cycle forces the clock
//   frequency to be low, which is why the VS MIPS is not practical.
//   The hardwired control implementation (below) spreads work out
//   over multiple cycles, resulting in higher performance. It also
//   uses a separate adder for incrementing the NPC, reducing the
//   number of cycles needed for some instructions.

 /// Hardwired Control Implementation  (EE 3755) 2012 and Earlier
//   http://www.ece.lsu.edu/ee3755/2013f/mips_hc.v.html
//   Low cost, small size.
//   Processors implemented this way may be used where the very smallest
//   size is needed, for example as part of a complete system on a
//   chip used to control, say, a microwave oven.  In reality,
//   even low-cost MIPS implementations are pipelined, unlike this
//   implementation (see below).

 /// Microcoded Control Implementation  (EE 3755)
//   http://www.ece.lsu.edu/ee3755/2001f/mipsmc.html
//   Low cost.
//   Microcoding usually applied to processors with more complex
//   instruction sets.

 /// Pipelined, Statically Scheduled  (EE 4720)
//   http://www.ece.lsu.edu/ee4720/v/mipspipeby.html
//   Similar to hardwired control, but much faster at little additional cost.
//   Many current processors implemented this way, e.g., Ultrasparc-III

 /// Pipelined, Dynamically Scheduled  (EE 4720)
//   http://www.ece.lsu.edu/ee4720/v/mipspipeds.html
//   Some additional speed, lots of additional complexity.
//   Many current processors implemented this way, e.g., Pentium 4, MIPS R10000

/// MIPS Functional Simulator

 //   http://www.ece.lsu.edu/ee3755/2013f/mips_fs.v.html

 ///  Simple as possible.

 ///  Uses
//    Illustrate basics of implementation to students.
//    Verify more complex implementations.

 ///  Key Characteristics
 ///   Each instruction executed in a single cycle.
//      Unrealistic because it precludes hardware sharing, for example,
//      using the ALU to compute a branch condition and target.

 ///   Memory is within processor. 
//      Unrealistic because memory is usually too large to fit on the
//      same chip as the CPU.
//      Even if some memory is on-chip an interface to that memory
//      (addr, we, etc) is usually needed for external communication.
//      Even if an external interface is present, we might not trust
//      the synthesis program to properly infer a memory from our
//      "casual" use of the mem array.

 ///   Almost no thought given to synthesized hardware.
//      Synthesis programs cannot yet be trusted to do a good job.

//  The problems above are only relevant if the description is synthesized.

//  If simulation is our goal then the MIPS functional simulator
//  implementation is fine because of its simplicity.  For simulation
//  a description should be as simple as possible to minimize the
//  chance of errors.

/// Very Simple MIPS Implementation

 //   Lectures: http://www.ece.lsu.edu/ee3755/2013f/mips-hw.pdf
 //   Verilog:  http://www.ece.lsu.edu/ee3755/2013f/mips_vsi.v.html

 ///  Low Cost Implementation

 ///  Key Characteristics
 ///   Memory is outside the CPU.
//    This is the way it should be.

 ///   The ALU is a separate module.
//    In the functional simulator a synthesis program might have
//    synthesized an adder each place an addition operator is used, a
//    bitwise AND for each AND, etc.  The Hardwired code instantiates
//    one ALU, so that's all the synthesis program will create.

 ///   ALU is Used as Much as Possible

//    The ALU is used for as much as possible, including computing
//    branch conditions, branch targets, and non-branch incrementing
//    of the PC (actually NPC). (Most real implementations use
//    a separate adder for incrementing the PC and computing
//    branch targets.)

 ///   Each instruction is executed in multiple cycles, but as few as possible.

//    By executing in multiple cycles hardware, such as the ALU and
//    memory port, can be used multiple times for the same
//    instruction. For example, the ALU can be used to compute a
//    branch condition (test if two registers are equal) and to
//    compute the branch target. The memory port can be used to read
//    the instruction from memory and for load instructions to read
//    the data from memory.
//    To keep things simple, as much work as possible is done in one
//    cycle. Packing a lot of work in one cycle forces the clock
//    frequency to be low, which is why the VS MIPS is not practical.

 ///  Further Refinements That Might Be Made

 ///   Use More States (Cycles) for Certain Instructions
//    The load instruction does alot more work in the ID state than
//    the add instruction. Since the clock frequency is based on the
//    critical path (worst case), the add instruction suffers because
//    of the load. To avoid this the load can be performed over
//    multiple states, that is what is done in the HC implementation.

/// Hardwired Control MIPS Implementation

 //   http://www.ece.lsu.edu/ee3755/2013f/mips_hc.v.html

 ///  Low Cost Implementation

 ///  Key Characteristics in Comparison to Very Simple Implementation
 ///   Additional States are Use for Load and Store Instructions
//    This enables a higher clock frequency.

 ///  Further Refinements That Might Be Made
 ///   Separate out additional modules.
//    In real systems arithmetic and shifting are done by separate
//    units because shifting logic cannot be shared with arithmetic
//    or logic.
//    Dividing parts of the design in to separate modules allows them
//    to be developed concurrently and makes tuning of cost (gates)
//    and performance (cycle time) easier.
//    Humans know the GPR is a memory with two read ports and a write
//    port.  The synthesis program might not realize it (perhaps due
//    bad Verilog coding style) and instead provide more write ports
//    than necessary or it might not realize it can use the
//    streamlined memory cell provided in the target technology
//    library.  Having a separate GPR module can avoid these problems.
 ///   Perform some limited pipelining.
//     The fetch of one instruction can be overlapped with the
//     execution of the previous one.  See Fall 2001 Homework 7
//     Problem 2.

 ///   Refine the ALU design.
//     Will the synthesis program efficiently combine the different
//     operations in the ALU module?  If not, re-do the alu with a
//     lower level design, something like alu2 in l05.v (Section 4.5
//     of the text).

 ///    Move code for simmed, etc. out of the procedural block to simplify hardware.
//     Note: the Very Simple Implementation already does this.

//     The synthesis program might create more registers and other
//     logic than necessary.  For example, the assignment
       rs_val = gpr[rs];
//     in the ID state might force the synthesis program to use a
//     register for rs_val (if it were accessed outside the ID state).
//     Making that a continuous assignment outside of procedural code
//     would solve the problem: 
       wire [31:0] rs_val = gpr[rs]; 
//     Another candidate for de-proceduralization is:
       {opcode,rs,rt,rd,sa,func} = ir; 
//     since ir doesn't change.

 ///   Check timing and move hardware between states.
//     For example, "move" the alu input multiplexors from ID to the
//     EX state.  The ID state would compute an alu control input,
//     like alu_a_src in the microcoded design.

 ///   Add the remaining MIPS instructions!

/// Microcoded Control MIPS Implementation

//   http://www.ece.lsu.edu/ee3755/2001f/mipsmc.html

 ///  Different Type of Control Logic

//  In the Hardwired Control Implementation the control signals (the
//  control inputs to the multiplexors and the enable inputs to the
//  registers) are output by combinational logic.  (The combinational
//  logic is generated from the case expressions in case statements
//  and the condition in if statements.  To reduce the number of
//  multiplexor inputs, optimization might make this logic quite
//  complex.)

//  In the Microcoded Control Implementation the control signals are
//  generated by a small computer, called the microcontroller.

//  Microcode is used in older mainframe-class systems.
//  Microprocessors might use microcode, or something like it, for
//  only a few instructions.

//  A microcoded MIPS implementation is very unlikely.

 ///  Key Characteristics
 ///    Control signals generated by microcontroller.
 ///    Control logic is very simple, complexity is in the microprogram ROM.

 ///  Further Refinements That Might Be Made

 ///    Make part of micro ROM writable.
//      This would allow new instructions to be added after
//      the part is manufactured.  (And yes, if it were writable
//      it probably shouldn't be called a ROM anymore.)

 ///    Implement all instructions!

/// Other Implementations

 // http://www.ece.lsu.edu/ee4720

 /// Implementation Techniques Covered in EE 4720

 /// Pipelined, Statically Scheduled
//  http://www.ece.lsu.edu/ee4720/v/mipspipeby.html
//  Overlap execution states, so the ID for one instruction
//  is done at the same time as IF for the previous one.
//  "States" for a pipelined processor are carefully chosen to
//  allow overlap.
//  One section of hardware, called a stage, is dedicated to doing IF,
//  one is dedicated to doing ID, and so on.  
//  Each stage does what the corresponding state did in the hardwired
//  processor.
//  The stages are: IF, ID, EX, MEM, WB.  The last is used to write
//  back the register file.
//  At any cycle there can be up to five instructions in the processor,
//  each in a different stage.
//  RISC ISAs (MIPS is one, IA-32 [80x86] is not) were chosen to
//  facilitate pipelined implementations, which is why most (if not
//  all) MIPS implementations are pipelined.
//  This pipelining opens up a can of worms, which we'll have fun
//  playing with in EE 4720.

 /// Pipelined, Dynamically Scheduled
//  http://www.ece.lsu.edu/ee4720/v/mipspipeds.html
//  Some instructions take a long time to produce results, floating
//  point instructions and memory loads, for example. (This problem
//  was completely avoided in EE 3755 by omitting floating-point
//  instructions and having perfect memory.)
//  In a dynamically scheduled processor instructions are fetched and
//  decoded in program order, but they execute when their operands
//  become available which might not be in program order.  An
//  instruction waiting for its operands (because a previous
//  instruction is taking a long time) does not prevent instructions
//  that follow it from being fetched, decoded, and executed.