/// LSU EE 3755 -- Spring 2002 -- Computer Organization
/// Note Set 13 -- Our MIPS Implementations

// Time-stamp: <29 April 2002, 15:21:40 CDT, koppel@neptune>

/// Contents
// Class MIPS Implementations
// Strawman MIPS Implementation
// Hardwired Control MIPS Implementation
// Microcoded Control MIPS Implementation
// Other Implementations

/// References
// :P:   Palnitkar, "Verilog HDL"
// :Q:   Qualis, "Verilog HDL Quick Reference Card Revision 1.0"
// :H:   Hyde, "Handbook on Verilog HDL"
// :LRM: IEEE, Verilog Language Reference Manual  (Hawaii Section Numbering)
// :PH:  Patterson & Hennessy, "Computer Organization & Design"
// :HP:  Hennessy & Patterson, "Computer Architecture: A Quantitative Approach"
// :Mv1: MIPS Technologies, "MIPS32 Architecture for Programmers Vol I: Intro"
// :Mv2: MIPS Technologies, "MIPS32 Architecture for Programmers Vol II: Instr"

/// Class MIPS Implementations

 /// WARNING: Links below to code will change in April and May 2002.

 /// Summary of Implementation Types

 //  All Verilog descriptions are synthesizable.
 //  They execute only a subset of MIPS instructions.

 /// Functional Simulator   (EE 3755)
//  http://www.ece.lsu.edu/3755/2002/mips_fs.html
//   Not intended for synthesis, but synthesizable nevertheless.
//   Illustrates basic implementation techniques.
//   Can be used to validate other models.

 /// Hardwired Control Implementation  (EE 3755)
//  http://www.ece.lsu.edu/3755/2002/mips_hc.html
//   Low cost, small size.
//   Processors implemented this way may be used where the very smallest
//   size is needed, for example as part of a complete system on a
//   chip used to control, say, a microwave oven.  In reality,
//   even low-cost MIPS implementations are pipelined, unlike this
//   implementation (see below).

 /// Microcoded Control Implementation  (EE 3755)
//  http://www.ece.lsu.edu/3755/2001f/mipsmc.html
//   Low cost.
//   Microcoding usually applied to processors with more complex
//   instruction sets.

 /// Pipelined, Statically Scheduled  (EE 4720)
//  http://www.ece.lsu.edu/ee4720/v/mipspipeby.html
//   Similar to hardwired control, but much faster at little additional cost.
//   Many current processors implemented this way, e.g., Ultrasparc-III

 /// Pipelined, Dynamically Scheduled  (EE 4720)
//  http://www.ece.lsu.edu/ee4720/v/mipspipeds.html
//   Some additional speed, lots of additional complexity.
//   Many current processors implemented this way, e.g., Pentium 4, MIPS R10000

/// MIPS Functional Simulator

 //  http://www.ece.lsu.edu/3755/2002/mips_fs.html

 ///  Simple as possible.

 ///  Uses
//    Illustrate basics of implementation to students.
//    Verify more complex implementations.

 ///  Key Characteristics
 ///   Each instruction executed in a single cycle.
//      Unrealistic because it precludes hardware sharing, for example,
//      using the ALU to compute a branch condition and target.

 ///   Memory is within processor. 
//      Unrealistic because memory is usually too large to fit on the
//      same chip as the CPU.
//      Even if some memory is on-chip an interface to that memory
//      (addr, we, etc) is usually needed for external communication.
//      Even if an external interface is present, we might not trust
//      the synthesis program to properly infer a memory from our
//      "casual" use of the mem array.

 ///   Almost no thought given to synthesized hardware.
//      Synthesis programs cannot yet be trusted to do a good job.

//  The problems above are only relevant if the description is synthesized.

//  If simulation is our goal then the MIPS functional simulator
//  implementation is fine because of its simplicity.  For simulation
//  a description should be as simple as possible to minimize the
//  chance of errors.

/// Hardwired Control MIPS Implementation

 //  http://www.ece.lsu.edu/3755/2002/mips_hc.html

 ///  Low Cost Implementation

 ///  Key Characteristics
 ///   Memory is outside the CPU.
//    This is the way it should be.

 ///   The ALU is a separate module.
//    In the functional simulator a synthesis program might have
//    synthesized an adder each place an addition operator is used, a
//    bitwise AND for each AND, etc.  The Hardwired code instantiates
//    one ALU, so that's all the synthesis program will create.

 ///   Each instruction is executed in multiple cycles.

 ///  Further Refinements That Might Be Made
 ///   Separate out additional modules.
//    In real systems arithmetic and shifting are done by separate
//    units because shifting logic cannot be shared with arithmetic
//    or logic.
//    Dividing parts of the design in to separate modules allows them
//    to be developed concurrently and makes tuning of cost (gates)
//    and performance (cycle time) easier.
//    Humans know the GPR is a memory with two read ports and a write
//    port.  The synthesis program might not realize it (perhaps due
//    bad Verilog coding style) and instead provide more write ports
//    than necessary or it might not realize it can use the
//    streamlined memory cell provided in the target technology
//    library.  Having a separate GPR module can avoid these problems.
 ///   Perform some limited pipelining.
//     The fetch of one instruction can be overlapped with the
//     execution of the previous one.  See Fall 2001 Homework 7
//     Problem 2.

 ///   Refine the ALU design.
//     Will the synthesis program efficiently combine the different
//     operations in the ALU module?  If not, re-do the alu with a
//     lower level design, something like alu2 in l05.v (Section 4.5
//     of the text).

 ///    Move code for simmed, etc. out of the procedural block to simplify hardware.
//     The synthesis program might create more registers and other
//     logic than necessary.  For example, the assignment
       rs_val = gpr[rs];
//     in the ID state might force the synthesis program to use a
//     register for rs_val (if it were accessed outside the ID state).
//     Making that a continuous assignment outside of procedural code
//     would solve the problem: 
       wire [31:0] rs_val = gpr[rs]; 
//     Another candidate for de-proceduralization is:
       {opcode,rs,rt,rd,sa,func} = ir; 
//     since ir doesn't change.

 ///   Check timing and move hardware between states.
//     For example, "move" the alu input multiplexors from ID to the
//     EX state.  The ID state would compute an alu control input,
//     like alu_a_src in the microcoded design.

 ///   Use the ALU to compute NPC + 4.

 ///   Add the remaining MIPS instructions!

/// Microcoded Control MIPS Implementation

//  http://www.ece.lsu.edu/3755/2001f/mipsmc.html

 ///  Different Type of Control Logic

//  In the Hardwired Control Implementation the control signals (the
//  control inputs to the multiplexors and the enable inputs to the
//  registers) are output by combinational logic.  (The combinational
//  logic is generated from the case expressions in case statements
//  and the condition in if statements.  To reduce the number of
//  multiplexor inputs, optimization might make this logic quite
//  complex.)

//  In the Microcoded Control Implementation the control signals are
//  generated by a small computer, called the microcontroller.

//  Microcode is used in older mainframe-class systems.
//  Microprocessors might use microcode, or something like it, for
//  only a few instructions.

//  A microcoded MIPS implementation is very unlikely.

 ///  Key Characteristics
 ///    Control signals generated by microcontroller.
 ///    Control logic is very simple, complexity is in the microprogram ROM.

 ///  Further Refinements That Might Be Made

 ///    Make part of micro ROM writable.
//      This would allow new instructions to be added after
//      the part is manufactured.  (And yes, if it were writable
//      it probably shouldn't be called a ROM anymore.)

 ///    Implement all instructions!

/// Other Implementations


 /// Implementation Techniques Covered in EE 4720

 /// Pipelined, Statically Scheduled
// http://www.ece.lsu.edu/ee4720/v/mipspipeby.html
//  Overlap execution states, so the ID for one instruction
//  is done at the same time as IF for the previous one.
//  "States" for a pipelined processor are carefully chosen to
//  allow overlap.
//  One section of hardware, called a stage, is dedicated to doing IF,
//  one is dedicated to doing ID, and so on.  
//  Each stage does what the corresponding state did in the hardwired
//  processor.
//  The stages are: IF, ID, EX, MEM, WB.  The last is used to write
//  back the register file.
//  At any cycle there can be up to five instructions in the processor,
//  each in a different stage.
//  RISC ISAs (MIPS is one, IA-32 [80x86] is not) were chosen to
//  facilitate pipelined implementations, which is why most (if not
//  all) MIPS implementations are pipelined.
//  This pipelining opens up a can of worms, which we'll have fun
//  playing with in EE 4720.

 /// Pipelined, Dynamically Scheduled
// http://www.ece.lsu.edu/ee4720/v/mipspipeds.html
//  Some instructions take a long time to produce results, floating
//  point instructions and memory loads, for example. (This problem
//  was completely avoided in EE 3755 by omitting floating-point
//  instructions and having perfect memory.)
//  In a dynamically scheduled processor instructions are fetched and
//  decoded in program order, but they execute when their operands
//  become available which might not be in program order.  An
//  instruction waiting for its operands (because a previous
//  instruction is taking a long time) does not prevent instructions
//  that follow it from being fetched, decoded, and executed.