EE 4720 1997 Midterm Exam Solution
Problem 1
Part (a)
The modified pipe below implements a high-speed version of mems,
the modified portions are shown in red.

mems executes as follows:
0 1 2 3 4 19 20
mems IF ID EX1 MEM1 MEM1 ... MEM1 MEM1 MEM2 WB
EX2 EX2 EX2 EX3
As usual, the instruction is fetched in IF. In ID rs1 (e.g., r1) is
loaded into latch buffer A and rs2 (e.g., r2) is loaded into B. The
EX stage is used three different ways, EX1, EX2,
and EX3. In EX1 the address, rs2, is passed
through the ALU unchanged. The ID/EX latches are not clocked until
the cycle after the element is found, so that rs1 is held in the B
buffer. (rs2 is also held, but is not needed.) In MEM1 the
element is retrieved from memory while at the same time the address of
the next element is computed in EX2. Using an added
comparison unit, EX2 also compares the element retrieved in
the previous cycle with rs1. The mems instruction will continue using
the EX and MEM stages (inserting NOPs into WB and stalling
instructions in IF and ID), until the comparison unit signals a
match (cycle 19 in the example).
In the cycle following a match (20) the execute stage performs
EX3, subtracting 4 to compute the address of the element
found in the previous cycle. The address makes its way back to rs2 in
MEM2 and WB.
Part (b)
LOOP:
lw r3, 0(r2)
addi r2, r2, #4 ! Words are four bytes long.
seq r3, r3, r1 ! Compare. Note that lw stall avoided.
beqz r3, LOOP
subi r2, r2, #4 ! Saves a branch inside or before loop.
Part (c)
The mems instruction executes for three cycles before reaching
MEM1. MEM1 is used once for each element
fetched, plus 1, for a total of 1001 cycles. Two more cycles complete
execution for a total of 1006 cycles.
An iteration of the program above executes as follows (on
a implementation that includes bypass paths)
0 1 2 3 4 5 6 7
lw IF ID EX MEM WB IF
addi IF ID EX MEM WB
seq IF ID EX MEM WB
beqz IF ID EX MEM WB
Each iteration takes seven cycles, one additional cycle is needed
after the last iteration to complete the subtraction.
The total number of cycles is 7 × 1000 + 1 = 7001.
Advanced techniques, covered in chapter 4, will reduce mems advantage.
Problem 2
Part (a)
0 1 2 3 4
lw IF ID EX MEM WB
add IF ID EX WB
The hazard occurs in cycle 4.
Part (b)
The modifications (additions) are shown in red.

If the "=ARITH" comparison unit in the EX stage detects an arithmetic
instruction, the IR and ALU outputs are written to the MEM/WB latches,
skipping a stage. Note that this replaces whatever was in MEM, resulting
in incorrect execution if a WB structural hazard is present.
Part (c)
The structural hazard could be avoided by adding a second write
port to the register file (additional changes would have
to be made to the part (b) solution)
or by having an arithmetic instruction
execute normally if skipping the MEM stage would result in
a structural hazard.
Part (d)
The benefit of skipping the MEM stage is that registers are written
one cycle early. If register forwarding is not used then early
writeback reduces the number of stall cycles due to RAW hazards,
which is worthwhile. If register forwarding is used, then there is no
benefit to early writeback since results from any arithmetic
instruction are available to instructions immediately following.
Problem 3
Part (a)
An undefined instruction in an ISA might still do something useful in
an implementation of that ISA. (This might happen by accident.)
Suppose programmers discover such an instruction in a current
implementation and use it in their code. These programs would run
fine on the current implementation, but the computer engineers working
on the new implementation of the ISA may make no attempt to see that
undefined instructions execute the same way. (Why should they, the
instructions were undefined and programmers were warned not to use
them.) Therefore, computers using the new chip may not be able to run
the programs using the undefined instructions.
The problem is avoided if undefined instructions raise an illegal
instruction exception.
Part (b)
The target address of a trap instruction is found in a table using
an index provided in the instruction. The target address of
a subroutine call is given in the instruction, be it PC relative,
register indirect, immediate, etc.
The target address of a trap
instruction is typically in the system part of the address space while
subroutines are in the user part.
As a result of executing a trap, the processor switches to privileged
mode, while subroutine calls do not change the processor mode.
Part (c)
All instructions that read registers have the register numbers in the
same place, bits 6..10 and 11..15, so they can be fetched without
knowing the identity of the instruction. There is no harm in fetching
registers for an instruction that does not need them, so registers can
be fetched for any instruction and so no decoding is necessary.
Part (d)
Program weights in a geometric mean of normalized execution times (or
rates) do not depend on their run time on the base machine, so a
particular base machine will not exaggerate or diminish the impact of
any program in the suite.
Part (e)
The clock rate might change because of faster technology (e.g., the
new fabrication process has a smaller feature size) or because
engineers chose to make pipeline segments shorter, allowing for a
faster clock.
The instruction count might change in two implementations of the
same ISA because relative execution times for individual
instructions might change (although the instructions available
in the old and new implementations are the same), leading programmers to
re-write time-critical portions. For example, an integer add might be
replaced by a speeded-up floating-point add, perhaps saving
integer-to-floating-point and floating-point-to-integer conversion
steps.
CPI might change because the number of stall cycles was reduced
using advanced techniques, etc.