# EE 4720 Computer Architecture - HW 5 Solution (Spring 1998)

 Problem 1,  EE 4720 HW 5 Solution Top   Next

 The execution is shown below. The first iteration takes five cycles (5-0), the second iteration takes 7 (12-5). The second iteration takes two cycles longer because of the first branch's one-cycle delay and the true dependency between the subi and bneq instructions.
```Cycle             0  1  2  3  4  5  6  7  8  9  10 11 12
LOOP:
lw   r1, 0(r2)   IF ID EX ME WB IF ID EX ME WB	      IF
addi r2, r2, #4  IF ID EX ME WB IF ID EX ME WB	      IF
beqz r1, TARG    IF ID ----> EX ME WB
IF ID ----> EX ME WB
IF
sub  r4, r4, r1  IF ID ----> EX ME WB
IF ID ----X
IF
j    LOOP           IF ----> ID EX ME WB
IF ----X IF x
TARG:
subi r5, r5, #1     IF ----> ID x  IF ----X IF ID EX ME WB
bnez r5, LOOP       IF ----> ID x  IF ----X IF ID -> EX ME WB
and  r4, r4, r6     IF ----> ID x  IF ----X IF ID -> x
or   r4, r4, r7              IF x	        IF -> x
sw   0(r8), r4               IF x	        IF -> x
addi r8, r8, #4              IF x	        IF -> x
jr   r31                     IF x	        IF -> x
```

 Problem 2,  EE 4720 HW 5 Solution Top   Previous   Next

 The loop in the program above is quite inefficient. There are many bubbles, and many instructions are cancelled because of the branch delay following the taken branches. When branching from an instruction other than the last in an aligned group (bneq and beqz are the third, j is the first) instructions following the CTI (branch or jump) are cancelled. Finally, when branching to TARG, which is not the first instruction in an aligned group, instructions before TARG are cancelled. Clearly, this loop makes poor use of our 4-way superscalar machine. How poor?
 An iteration of the loop takes 5 cycles when the branch is not taken, and takes 7 cycles when taken. Either way, the loop has 5 instructions. These numbers can be used to compute the CPI of the execution of a large number of iterations because they do not depend on what happened in the previous iterations. There can be inter-iteration hazards when long latency functional units are used because the result of an operation started in one iteration might be needed in the next, or the functional unit itself might be needed. These problems do not occur here.
 The first branch in the loop is taken if the value read from memory is zero. According to the problem statement 30% of those words are zero. The CPI is found by taking a weighted average as shown:
CPI = 0.3 ttakenICtaken + 0.7 tnot takenICnot taken
CPI = 0.3 7/5 + 0.7 5/5
CPI = 1.12
 Keep in mind that the ideal CPI for a 4-way machine is 0.25.

 Problem 3,  EE 4720 HW 5 Solution Top   Previous   Next

#### Branch Delay Slots

 Branch delay slots would have very little effectiveness for the code above. Branch delay slots might seem like a promising enhancement for superscalar machines because a branch can have a large delay: 7 instructions if the branch is the first instruction in an aligned group. With a 7-delay-slot branch these bubbles would be avoided. Branch delay slots are only useful if they can be filled with something other than nop's, in the code above (without major re-writing) the only candidate for a branch delay slot is the addi instruction. If the loop is unrolled there might be more opportunities for filling delay slots.

#### Dynamic Scheduling

 Dynamic scheduling would have very little effectiveness for the code above, though it would be more useful on other code. Dynamic scheduling helps avoid stalls due to name dependencies and allows instructions to start out of order. In the code above the stalls are due to true dependencies (between lw and addi, for example), and control dependencies. Without branch prediction and speculative execution there would be no benefit.

#### Branch Target Buffer

 A branch target buffer would be moderately effective for the code above, but its execution would still be well below the maximum issue rate. As stated above, the branch execution is slowed due to branches. A branch target buffer would help by eliminating the branch delay slot caused by the beqz, bnez, and j instructions, the overall effectiveness would depend on the branch prediction accuracy. The bneq branch would probably be correctly predicted because it is only not taken on the last iteration. The behavior of the beqz branch depends on the characteristics of the values read. If the zeros are bunched (0,0,0,0,0,1,3,2) then prediction should work well, if they're randomly interspersed the predictor won't work.

#### Predicated Execution

 Predicated execution would have limited effectiveness Predicated execution can be used to eliminate one or two branches from the loop, thus eliminating wasted issue slots from taken branches. The loop with one branch removed might look like:
```lw r1,0(r2)      IF ID EX ME WB
addi r2,r2,#4	 IF ID EX ME WB
set_c r1	 IF ID ----> EX ME WB
sub_pn r4,r4,r1	 IF ID ----> EX ME WB
bneq r1,LOOP	    IF ----> ID EX ME WB
subi_pz r5,r5,#1    IF ----> ID EX ME WB
bneq r5,LOOP	    IF ----> ID EX ME WB
```
 The predicate bit is set by the set_c instruction, here that value is being bypassed to the WB stage so there is no additional stall for sub_pn. The branch delay due to the removed branch is no longer present.

#### Loop Unrolling

 Loop unrolling can be very effective on the code above. Loop unrolling is effective when several iterations can be overlapped. One problem with loop unrolling is that the number of iterations may not be the multiple of the degree of unrolling. For example, the code above was unrolled 4 times, but the number of iterations may not be a multiple of 4. The problem is easiest to fix if the number of iterations is known at compile time: the compiler would end the loop at the largest multiple smaller than the number of iterations, and fix-up code following the unrolled loop would perform the remaining few iterations. In the loop below, not only is the number of iterations not known at compile time (presumably), the number of iterations may not be known until the loop finishes its last iteration. (The number of iterations depends on the data fetched.) That's still no problem as long as the it's okay to make unneeded memory accesses.
 In the unrolling below the branch in the middle of the loop is eliminated. The `sub r4, r4, r1` can be performed whether or not the branch is taken, and `seqi` and `add` instructions are used to decrement `r5` for four elements at once. The unrolled loop has only a single branch, iteration continues if `r5` is positive. When `r5` will be zero or negative the loop exits and fix-up code (not shown) re-loads words and adjusts `r4`. (It's possible that too many elements were subtracted.)
 In the unrolled loop values are fetched one iteration ahead of time, so when it finishes at least four extra values will have been fetched.
 Using `sle r30, r5, r25` the branch condition can be determined one cycle earlier than if a `sle r30, r5, r0` instruction were used after `sub r5, r5, r25`.
 There are no stalls to avoid RAW hazards, the only wasted issue slots (bubbles) are due to the branch. An iteration of the unrolled loop takes 6 cycles, for a CPI of 6/19=0.316. The CPI is not useful for comparing with the original loop since the instruction counts are different. A better metric would be cycles per element (an iteration in the original loop). The original loop took a weighted average of 0.7 5 + 0.3 7 = 5.6 cycles per element. The unrolled loop processes four elements per iteration, so it takes 6/4=1.5 cycles per element.
``` lw   r11, 0(r2)
lw   r12, 4(r2)
lw   r13, 8(r2)
lw   r14, 12(r2)

ULOOP:  Assume ULOOP aligned.
seqi r21, r11, #0  ! Check if elements are zero.
seqi r22, r12, #0  ! =1 if element is zero, 0 otherwise.
seqi r23, r13, #0  !
seqi r24, r14, #0  !

add  r15, r11, r12 ! Sum of element values.
add  r25, r21, r22 ! Count of zero elements.

add  r15, r15, r16 ! Sum of element values.
add  r25, r25, r26 ! Count of zero elements.
lw   r11, 16(r2)   ! For next iteration, if needed.
lw   r12, 20(r2)   ! For next iteration, if needed.

sle  r30, r5, r25  ! Branch condition.
sub  r5,  r5, r25  ! Update r5.
lw   r13, 24(r2)   ! For next iteration, if needed.
lw   r14, 28(r2)   ! For next iteration, if needed.

sub  r4, r4, r15   ! Maybe over-subtracting. Fix-up code should fix.
bneq r30, ULOOP

! Fix-up code starts. (not shown)
! Need to re-load last four words and adjust r4 and r5

! Fix up code ends.

and  r4, r4, r6
or   r4, r4, r7
sw   0(r8), r4
jr   r31

```

 Problem 4,  EE 4720 HW 5 Solution Top   Previous

 Instructions can be rearranged so that all conditionally executed instructions are in the first bundle. There is a dependency between the second `addi` and `and` and `slt`, so they must be placed in separate bundles. There are no other instructions to put in the second and third bundles, so they are padded with nops.
```bx   r1, !r1, !r1

sub  r6, r6, r2