EE 4720 Computer Architecture - HW 4 Solution (Spring 1998)
|
The most informative notation would indicate both the hardware the
instruction were in and how far along execution it was. (The existing
notation shows only how far along execution an instruction is, which
unambiguously indicates the hardware only when the initiation interval
is one). In one possible solution the location is indicated with a 3-part
label. The first part shows the functional unit using a capital letter
(A for add, etc.). The second part shows which part of the functional
unit the instruction is in, in parenthesis. The third part shows the
execution step. For example, A(2)2 shows an instruction in the
second adder segment, which is also the second step of execution.
(When the initiation interval is one the execution unit part and
step will always be the same.) As another example, D(1)20 shows
an instruction in the one-and-only divide unit part, in the twentieth
step of execution. | |
|
The notation is used in the pipeline execution
below, which runs on the implementation described in problem 2. | |
div f11, f4, f5 IF ID D(1)1 D(1)2 D(1)3 D(1)4 D(1)5 D(1)6 D(1)7 D(1)8 D(1)9
mul f0, f1, f2 IF ID M(1)1 M(1)2 M(2)3 M(2)4 MEM WB
mul f3, f6, f7 IF ID ----> M(1)1 M(1)2 M(2)3 M(2)4 MEM WB
sub f8, f9, f10 IF ----> ID A(1)1 A(2)2 A(3)3 --> MEM WB
|
The problem did not specify how WAW hazards were to be resolved.
They could be resolved by stalling the second multiply so it writes
after the divide or by cancelling the divide when the multiply is
either in ID or WB. (The divide can be canceled because no
instruction reads the value it produces.) The second approach will be
used since it does not stall following instructions. The execution
diagram appears below, using the notation from part 1
| |
Time 0 1 2 3 4 5 6 7 8 9 10 11 12
div f3, f4, f5 IF ID D(1)1 D(1)2 D(1)3 D(1)4 D(1)5 D(1)6 D(1)7 D(1)8 x
mul f0, f1, f2 IF ID M(1)1 M(1)2 M(2)3 M(2)4 MEM WB
mul f3, f6, f7 IF ID ----> M(1)1 M(1)2 M(2)3 M(2)4 MEM WB
sub f8, f9, f10 IF ----> ID A(1)1 A(2)2 A(3)3 --> MEM WB
mul f11, f0, f12 IF ID M(1)1 M(1)2 M(2)3 M(2)3 MEM WB
|
The execution above has two stalls, one in cycle 4 due to the
multiply-unit structural hazard, the other in cycle 9 due to the
memory stage structural hazard. | |
|
As with the previous problem, WAW hazards will be handled by
cancelling the first instruction writing a register when the
second instruction writing that register is in the WB stage. | |
Time 0 1 2 3 4 5 6 7 8 9 10 11 12
div f3, f4, f5 IF ID D(1)1 D(1)2 D(1)3 D(1)4 D(1)5 D(1)6 D(1)7 D(1)8 x
mul f0, f1, f2 IF ID M(1)1 M(1)2 M(2)3 M(2)4 MEM WB
mul f3, f6, f7 IF ID ----> M(1)1 M(1)2 M(2)3 M(2)4 MEM WB
sub f8, f9, f10 IF ----> ID ----> A(1)1 A(2)2 A(3)3 MEM WB
mul f11, f0, f12 IF ----> ID M(1)1 M(1)2 M(2)3 M(2)3 MEM WB
|
Notice that the ID-stage stall delays the third multiply by one cycle.
| |
|
In the execution below, the integer unit uses reservation
stations 3 and 4. Branch instructions are shown stopping after ID since they don't
do anything useful after that and so reservation stations are not shown.
(If the branch had to wait for |r1| it would
sit in a reservation station which would be shown.)
The execution
is shown until cycle 25, two cycles after the second multiply writeback.
| |
Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
addi IF ID 3:EX 3:WB
LOOP:
ld IF ID 4:EX 4:ME 4:WB IF ID 4:EX 4:ME 4:WB IF ID 4:EX 4:ME 4:WB IF ID 4:EX 4:ME 4:WB IF
subi IF ID 3:EX ---> 3:WB IF ID 3:EX ---> 3:WB IF ID 3:EX ---> 3:WB IF ID 3:EX ---> 3:WB
mul IF ID 1:M1 1:M2 1:M3 1:M4 1:M5 1:M6 1:M7 1:M8 1:M9 1:WB
IF ID 2:RS 2:RS 2:RS 2:RS 2:M1 2:M2 2:M3 2:M4 2:M5 2:M6 2:M7 2:M8 2:M9 2:WB
IF ID 1:RS 1:RS 1:RS 1:RS 1:RS 1:RS 1:RS 1:RS 1:M1 1:M2 1:M3
IF ID ------------------> 2:RS 2:RS
bneq IF ID IF ID IF ID IF ------------------> ID
I1 IF IF IF IF
|
Before reservation stations run out, each iteration of the loop above takes
five cycles, after they run out nine cycles per iteration will be needed. The
reservation stations allow the "integer" part of the loop to get several cycles
ahead of the floating point part.
| |
|
In the solution to the previous problem, in cycle 9 the multiply
instruction is in ID for the second time. It can move into the
execute stage only if the result from the previous iteration is
ready. Since the first multiply is in ID in cycle 4, the multiply unit
would have to produce a result in 5 (or fewer) cycles to avoid
delaying the second multiply. If the execution of the multiply is
delayed by any amount, all reservation stations will eventually be used
up. A multiply unit that produces a result in 5 cycles has a latency of 4.
| |