EE 4720 Computer Architecture - HW 5 Solution (Spring 1998)
|
The execution is shown below. The first iteration takes
five cycles (5-0), the second iteration takes 7 (12-5).
The second iteration takes two cycles longer because
of the first branch's one-cycle delay and the true
dependency between the subi and bneq instructions.
| |
Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12
LOOP:
lw r1, 0(r2) IF ID EX ME WB IF ID EX ME WB IF
addi r2, r2, #4 IF ID EX ME WB IF ID EX ME WB IF
beqz r1, TARG IF ID ----> EX ME WB
IF ID ----> EX ME WB
IF
sub r4, r4, r1 IF ID ----> EX ME WB
IF ID ----X
IF
j LOOP IF ----> ID EX ME WB
IF ----X IF x
TARG:
subi r5, r5, #1 IF ----> ID x IF ----X IF ID EX ME WB
bnez r5, LOOP IF ----> ID x IF ----X IF ID -> EX ME WB
and r4, r4, r6 IF ----> ID x IF ----X IF ID -> x
or r4, r4, r7 IF x IF -> x
sw 0(r8), r4 IF x IF -> x
addi r8, r8, #4 IF x IF -> x
jr r31 IF x IF -> x
|
The loop in the program above is quite inefficient. There are many
bubbles, and many instructions are cancelled because of the branch
delay following the taken branches. When branching from an
instruction other than the last in an aligned group (bneq and beqz are
the third, j is the first) instructions following the CTI (branch or
jump) are cancelled. Finally, when branching to TARG, which is not the
first instruction in an aligned group, instructions before TARG are
cancelled. Clearly, this loop makes poor use of our 4-way
superscalar machine. How poor?
| |
|
An iteration of the loop takes 5 cycles when the branch is not taken,
and takes 7 cycles when taken. Either way, the loop has 5
instructions. These numbers can be used to compute the CPI of the
execution of a large number of iterations because they do not depend
on what happened in the previous iterations.
There can be inter-iteration hazards when long latency functional
units are used because the result of an operation started in one
iteration might be needed in the next, or the functional unit itself
might be needed. These problems do not occur here.
| |
|
The first branch in the loop is taken if the value read from memory
is zero. According to the problem statement 30% of those words are zero.
The CPI is found by taking a weighted average as shown:
| |
CPI = 0.3 ttakenICtaken + 0.7 tnot takenICnot taken
CPI = 0.3 7/5 + 0.7 5/5
CPI = 1.12
|
Keep in mind that the ideal CPI for a 4-way machine is 0.25.
| |
Branch Delay Slots
|
Branch delay slots would have very little effectiveness for the code
above. Branch delay slots might seem like a promising enhancement for
superscalar machines because a branch can have a large delay: 7
instructions if the branch is the first instruction in an aligned
group. With a 7-delay-slot branch these bubbles would be avoided.
Branch delay slots are only useful if they can be filled with
something other than nop's, in the code above (without major
re-writing) the only candidate for a branch delay slot is the addi
instruction. If the loop is unrolled there might be more
opportunities for filling delay slots. | |
Dynamic Scheduling
|
Dynamic scheduling would have very little effectiveness
for the code above, though it would be more useful on other code.
Dynamic scheduling helps avoid stalls due to name dependencies
and allows instructions to start out of order. In the code above
the stalls are due to true dependencies (between lw and addi, for example),
and control dependencies. Without branch prediction and speculative
execution there would be no benefit. | |
Branch Target Buffer
|
A branch target buffer would be moderately effective for
the code above, but its execution would still be well below
the maximum issue rate.
As stated above, the branch execution is slowed due to branches.
A branch target buffer would help by eliminating the branch delay
slot caused by the beqz, bnez, and j instructions, the overall
effectiveness would depend on the branch prediction accuracy.
The bneq branch would probably be correctly predicted because
it is only not taken on the last iteration. The behavior of the
beqz branch depends on the characteristics of the values read.
If the zeros are bunched (0,0,0,0,0,1,3,2) then prediction should
work well, if they're randomly interspersed the predictor won't work.
| |
Predicated Execution
|
Predicated execution would have limited effectiveness Predicated execution
can be used to eliminate one or two branches from the loop, thus
eliminating wasted issue slots from taken branches. The loop with one
branch removed might look like: | |
lw r1,0(r2) IF ID EX ME WB
addi r2,r2,#4 IF ID EX ME WB
set_c r1 IF ID ----> EX ME WB
sub_pn r4,r4,r1 IF ID ----> EX ME WB
bneq r1,LOOP IF ----> ID EX ME WB
subi_pz r5,r5,#1 IF ----> ID EX ME WB
bneq r5,LOOP IF ----> ID EX ME WB
|
The predicate bit is set by the set_c instruction, here that
value is being bypassed to the WB stage so there is no
additional stall for sub_pn.
The branch delay due to the removed branch is no longer
present. | |
Loop Unrolling
|
Loop unrolling can be very effective on the code above.
Loop unrolling is effective when several iterations can
be overlapped. One problem with loop unrolling is that the number
of iterations may not be the multiple of the degree of unrolling.
For example, the code above was unrolled 4 times, but the number of
iterations may not be a multiple of 4. The problem is easiest to fix
if the number of iterations is known at compile time: the compiler
would end the loop at the largest multiple smaller than the number
of iterations, and fix-up code following the unrolled loop would perform
the remaining few iterations. In the loop below, not only is the
number of iterations not known at compile time (presumably), the
number of iterations may not be known until the loop finishes its
last iteration.
(The number of iterations depends on the data fetched.) That's
still no problem as long as the it's okay to make unneeded memory
accesses.
| |
|
In the unrolling below the branch in the middle of the loop is
eliminated. The sub r4, r4, r1 can be performed whether
or not the branch is taken, and seqi and add
instructions are used to decrement r5 for four
elements at once. The unrolled loop
has only a single branch, iteration continues if r5 is
positive. When r5 will be zero or negative the loop
exits and fix-up code (not shown) re-loads words and adjusts
r4 . (It's possible that too many elements were
subtracted.) | |
|
In the unrolled loop values are fetched one iteration ahead of time, so
when it finishes at least four extra values will have been fetched.
| |
|
Using sle r30, r5, r25 the branch condition can be
determined one cycle earlier than if a sle r30, r5, r0
instruction were used after sub r5, r5, r25 . | |
|
There are no stalls to avoid RAW hazards, the only wasted issue
slots (bubbles) are due to the branch. An iteration of the unrolled
loop takes 6 cycles, for a CPI of 6/19=0.316. The CPI is not useful for
comparing with the original loop since the instruction counts are
different. A better metric would be cycles per element (an iteration
in the original loop). The original loop took a weighted average of
0.7 5 + 0.3 7 = 5.6 cycles per element. The unrolled loop processes
four elements per iteration, so it takes 6/4=1.5 cycles per element.
| |
lw r11, 0(r2)
lw r12, 4(r2)
lw r13, 8(r2)
lw r14, 12(r2)
ULOOP: Assume ULOOP aligned.
seqi r21, r11, #0 ! Check if elements are zero.
seqi r22, r12, #0 ! =1 if element is zero, 0 otherwise.
seqi r23, r13, #0 !
seqi r24, r14, #0 !
add r15, r11, r12 ! Sum of element values.
add r16, r13, r14
add r25, r21, r22 ! Count of zero elements.
add r26, r23, r24
add r15, r15, r16 ! Sum of element values.
add r25, r25, r26 ! Count of zero elements.
lw r11, 16(r2) ! For next iteration, if needed.
lw r12, 20(r2) ! For next iteration, if needed.
sle r30, r5, r25 ! Branch condition.
sub r5, r5, r25 ! Update r5.
lw r13, 24(r2) ! For next iteration, if needed.
lw r14, 28(r2) ! For next iteration, if needed.
addi r2, r2, #16
sub r4, r4, r15 ! Maybe over-subtracting. Fix-up code should fix.
bneq r30, ULOOP
! Fix-up code starts. (not shown)
! Need to re-load last four words and adjust r4 and r5
! Fix up code ends.
and r4, r4, r6
or r4, r4, r7
sw 0(r8), r4
addi r8, r8, #4
jr r31
|
Instructions can be rearranged so that all conditionally executed
instructions are in the first bundle. There is a dependency between
the second addi and and and
slt , so they must be placed in separate bundles. There
are no other instructions to put in the second and third bundles,
so they are padded with nops.
| |
bx r1, !r1, !r1
add r2, r3, r4
add r2, r3, r5
addi r7, r7, #1
sub r6, r6, r2
addi r8, r2, #12
nop
nop
slt r1, r8, r9
and r10, r8, rll
nop
nop