EE 4720 1997 Practice Final Exam Solution

Problem 1a

System B has a smaller BHT, so one can write a program having two branch instructions that are mapped to distinct BHT entries on System A but the same entry on System B. On System B, with a single entry being used for two branch instructions, the program can be written so that predictions interfere.

In System B, with 1024 entries, the BHT is indexed with 10 bits of instruction address. In System A, with 512 entries, 9 bits are used.

In the program below, instructions are preceded by their addresses. The bnez branch will always be taken, the beqz branch will rarely be taken. Because of their addresses, the bits used to index the BHT will be different in the two branches on System A (bits 2..11) but the same on System B (bits 2..10).

 0x0800 addi r1, r0, #9000   ! Number of iterations to do.
 0x0804 bnez r1, target      ! We know it's always taken.

 0x1000 subi r1, r1, #1      ! Decrement loop counter.
 0x1004 beqz r1, done	     ! Taken only in the last iteration.
 0x1008 j loop
 0x100A jalr r31             ! Return (to caller not shown here).

Problem 1b

System B uses 4 bits for its counter while system A uses a two-bit 2-bit counter. For branches having steady long-term behavior, the smaller counter in System A can be "tricked" by an apparent short-term change in behavior.

In the program below, the beqz branch is taken 25% of the time. Since samples returned from the random function are independent from call to call (as are the outcomes of a coin toss, roulette wheel spin, etc.) the long-term history of a 4-bit counter is better than the short-term history of a 2-bit counter. (No matter how many consecutive times the branch is taken, the probability of it being taken is still 25%.)

 addi r4, r0, #9000 ! Number of iterations to do.
 add  r30, r31, r0  ! Save return address.
 subi r4, r4, #1
 bnez r4, continue 
 add  r31, r30, r0  ! Place saved return address it r31.                    
 jalr r31           ! Return (to caller not shown here).                    
 jal random         ! 32-bit random number generator, returns val in r1.    
 andi r2, r1, #3    ! Mask, leaving only two least-significant bits.        
 beqz r2, loop      ! Branch taken 25% of time. Better predictions on Sys B.
 j    loop

Problem 2a

The length field should be placed in the first byte of every instruction. Putting it in the last byte is not very helpful since if the location of the last byte could be determined the contents of the length field would not be needed. Other positions are unsuitable for similar reasons.

Problem 2b

Note, this problem is tricky. Be sure to check your solution by drawing a timing diagram showing the execution of several instructions.

The solution is to have the IF stage fetch one extra byte, in most case this will be the first byte of the next instruction. (This is easy to do with the system described in the problem.) With the length field contained in the extra byte, the complete instruction can be fetched (on time) in the next cycle.

Time  0	  1   2	  3   4
add	  IF  ID  EX  MEM WB
sub	      IF  ID  EX  MEM WB
Consider the computation of the address for the sub instruction. The address must be ready by the beginning of cycle 2, so it must be computed in cycle 1. In cycle 1, the address of the add instruction is known, but without the modification described above, the length of the add would not be known until the end of cycle 1, too late to compute the address needed for cycle 2. The modified hardware always fetches an extra byte. So the first byte of the add instruction fetched in cycle 1 was fetched in cycle 0 and clocked into the IF/ID latch. So at cycle 1 we add the length of the add instruction to its address, to obtain the address of the sub instruction.

If I have time, I'll include an illustration.

Problem 3a

Size of cache: 4 × 1024 × 64 = 22 + 210 + 26 = 218 = 262144.

Offset bits: 0..5; index bits: 6..15; tag bits: 16..31.

Problem 3b

Loop 1

 for( i=0; i<400; i++) total += a[i];
Suppose 0x8000 were the address of the first element. Since an element is four bytes, the address of the second element would be 0x8004. The index and tag portions of the two addresses are the same so they are in the same block. Even if the first address misses, the second one will hit (assuming no intervening accesses). The sixteenth (i=15) element's address is 0x803c, the index and tag are the same as the others so they all share a block. For i=16, the address is 0x8040, the index here is different. The tags for all elements (up to i=399) are the same, so one address cannot push another out of the cache.

Since every address is accessed once and there are 16 elements per block, the hit ratio is 15/16 = .9375. (Note that 400 is divisible by 16.)

Loop 2

for( j=0; j<2; j++)
 for( i=0; i<400; i++) total += a[i];
The hit ratio during the first iteration of the outer loop is 15/16, as in the previous example. In the second iteration, all accesses hit, for an overall hit ratio of (15+16)/(16+16) = 0.96875.

Loop 3

for( j=0; j<2; j++)
 for( i=0; i<400; i++) total += a[i] + a[ i + 0x08000000 ];
The address for a[ i + 0x08000000 ] is in the same set as a[i] (their indices are the same but their tags are different). Since the cache is 4-way there is room for four addresses in a set, (and more than enough room for 2). If random replacement were used it would be possible that the second element could replace the first one even if there were no other addresses in the set. With LRU replacement, this isn't a problem.

Because a[ i + 0x08000000 ] and a[i] behave identically and do not interfere, the hit ratio is the same as the previous example, 0.96875.

Loop 4

for( j=0; j<2 j++)
 for( i=0; i<16; i++) total += a[ i * 0x00800000 ]; /* Wow, what a big array! */
Note that it's i × 0x00800000. (Also note that with the multiplier used in the original practice exam, the array would not fit in a 32-bit address space. The multiplier above should work.) Each value of i results in an address with the same index but a different tag. Therefore, all 16 addresses fall into the same set. The fifth one to enter pushes the first one out. Tragically (from a performance point of view) this cache never hits. Because consecutive accesses are to far-away elements, there are no hits due to spatial locality (next element in same block), as there were in the previous loops. Since each element is removed before it's used again, there are no hits due to temporal locality (the second iteration of the outer loop).

With random replacement there is a small chance that a cache entry brought in in the first iteration will survive long enough to be accessed again in the second iteration.

Problem 4a

The loop below was taken from homework 7, problem 1. To solve the problem properly one had to know details provided in the homework assignment but omitted from the practice test.

The first loop iteration starts at cycle 4 and ends at cycle 23 (the time of the first instruction of the loop [ld] is in the IF stage), for a total of 19 cycles. The second iteration will be different.

Time 0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28
muld IF   ID   5:A1 5:A2 5:A5 5:A4 5:WB
addd      IF   ID   3:RS 3:RS 3:RS 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB
muld           IF   ID   6:RS 6:RS 6:RS 6:M1 6:M2 6:M3 6:M4 6:WB 6:WB
subd                IF   ID   4:RS 4:RS 4:RS 4:A1 4:A2 4:A3 4:A4 4:WB 4:WB
Time 0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28
ld                       IF   ID   2:EX 2:MI 2:WI                                                                       IF   ID
subd                          IF   ID                            3:RS 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB
ld                                 IF                            ID   1:EX 1:MI 1:WB
subd                                                             IF   ID   4:RS 4:RS                4:A1 4:A2 4:A3 4:A4 4:WB
ld                                                                    IF   ID   2:EX 2:MI 2:WB
subd                                                                       IF   ID                  3:RS                     3:A1 3:A2 3:A3 3:A4 3:WB
subi                                                                            IF                  ID   1:EX 1:MI 1:WB
addi                                                                                                IF   ID   2:EX 2:MI 2:WB
bnez                                                                                                     IF   ID        1:EX 1:MI 1:WB
Time 0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28
As can be seen below, the second iteration starts at cycle 23 and ends at 40, for a total of 17 cycles. The first iteration takes longer because of the instructions preceding the loop which have true dependencies with loop instructions and which are using both add-unit reservation stations when the first loop subd instruction executes.

Time  19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40
ld                        IF   ID   2:EX 2:MI 2:WI                                                   IF   ID   2:EX
subd                           IF   ID   4:RS 4:RS 4:RS 4:A1 4:A2 4:A3 4:A4 4:WB                          IF   ID
ld                                  IF   ID   1:EX 1:MI 1:WI                                                   IF
subd  4:A1 4:A2 4:A3 4:A4 4:WB           IF   ID        3:RS 3:RS 3:RS 3:RS 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB
ld                                            IF        ID   2:EX 2:MI 2:WI
subd  3:RS                     3:A1 3:A2 3:A3 3:A4 3:WB IF   ID                  4:RS 4:RS 4:RS 4:RS 4:RS 4:A1 4:A2
subi  ID   1:EX 1:MI 1:WI                                    IF                  ID   1:EX 1:MI 1:WI
addi  IF   ID   2:EX 2:MI 2:WI 2:WI                                              IF   ID   2:EX 2:MI 2:WI 2:WI
bnez       IF   ID        1:EX 1:MI 1:WI                                              IF   ID	     1:EX 1:MI 1:WI 
Time  19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40
With a large number of iterations, 17 cycles and 9 instructions per iteration, the CPI is 1.89

Problem 4b

On systems using small page sizes, programs accessing small pieces of data spread throughout the address space use real memory efficiently. (If the page size were large, say 1MB, then the program might, for example, use 1MB of real memory just for a 100-byte data structure.)

A disadvantage of small page sizes is that the overhead in managing them is higher. That overhead is incurred in larger page tables and either a larger TLB or more frequent TLB misses. Another disadvantage is that smaller page sizes can leave spatial locality unexploited.

Problem 4c

An instruction that is supposed to execute once may execute twice. In the example below, the addi might complete before a floating-point overflow were detected. Suppose, after handling the exception, the handler returns to to the addi instruction. Then addi would execute a second time when only one execution was intended. In this particular case the addi could be skipped (assuming the div was not to be re-started), but it's not possible to find a correct return point when any subset of following instructions may have completed. For example, if both the addi and subi complete but the mul does not, there would be no place for the handler return to.

 div f0, f1, f2
 addi r1, r1, #1
 mul f3, f0, f4
 subi r2, r2, #2

Problem 4d

Delay slots give the pipeline extra cycles to compute the target address. Since useful instructions can be placed in the delay slot, fewer cycles are wasted if the branch is taken.

The best number of delay slots to use is determined by implementation technology. The number of delay slots chosen by the ISA designers might work well with the contemporary technology, but will be less useful in future implementations. For example changing the number of pipeline stages might change the optimal number of delay slots. The number of delay slots is fixed in the ISA, so future implementations must support them, which means added complexity with no benefit.

ECE Home Page 4720 Home Page Up
David M. Koppelman -
Modified 8 May 1997 15:43 (20:43 UTC)