In System B, with 1024 entries, the BHT is indexed with 10 bits of instruction address. In System A, with 512 entries, 9 bits are used.

In the program below, instructions are preceded by their addresses.
The `bnez`

branch will always be taken, the
`beqz`

branch will rarely be taken. Because of their
addresses, the bits used to index the BHT will be different in the two
branches on System A (bits 2..11) but the same on System B (bits
2..10).

0x0800 addi r1, r0, #9000 ! Number of iterations to do. loop: 0x0804 bnez r1, target ! We know it's always taken. target: 0x1000 subi r1, r1, #1 ! Decrement loop counter. 0x1004 beqz r1, done ! Taken only in the last iteration. 0x1008 j loop done: 0x100A jalr r31 ! Return (to caller not shown here).

In the program below, the `beqz`

branch is taken 25% of the
time. Since samples returned from the random function are independent
from call to call (as are the outcomes of a coin toss, roulette wheel
spin, etc.) the long-term history of a 4-bit counter is better than
the short-term history of a 2-bit counter. (No matter
how many consecutive times the branch is taken, the probability of it
being taken is still 25%.)

addi r4, r0, #9000 ! Number of iterations to do. add r30, r31, r0 ! Save return address. loop: subi r4, r4, #1 bnez r4, continue add r31, r30, r0 ! Place saved return address it r31. jalr r31 ! Return (to caller not shown here). continue: jal random ! 32-bit random number generator, returns val in r1. andi r2, r1, #3 ! Mask, leaving only two least-significant bits. beqz r2, loop ! Branch taken 25% of time. Better predictions on Sys B. j loop

The solution is to have the IF stage fetch one extra byte, in most case this will be the first byte of the next instruction. (This is easy to do with the system described in the problem.) With the length field contained in the extra byte, the complete instruction can be fetched (on time) in the next cycle.

Time 0 1 2 3 4 add IF ID EX MEM WB sub IF ID EX MEM WBConsider the computation of the address for the

`sub`

instruction.
The address must be ready by the beginning of cycle 2, so it must
be computed in cycle 1. In cycle 1, the address of the `add`

instruction is known, but without the modification described above,
the length of the `add`

would not be known until the end
of cycle 1, too late to compute the address needed for cycle 2. The
modified hardware always fetches an extra byte. So the first byte
of the `add`

instruction fetched in cycle 1 was fetched
in cycle 0 and clocked into the IF/ID latch. So at cycle 1 we add
the length of the `add`

instruction to its address, to
obtain the address of the `sub`

instruction. If I have time, I'll include an illustration.

Offset bits: 0..5; index bits: 6..15; tag bits: 16..31.

for( i=0; i<400; i++) total += a[i];Suppose 0x8000 were the address of the first element. Since an element is four bytes, the address of the second element would be 0x8004. The index and tag portions of the two addresses are the same so they are in the same block. Even if the first address misses, the second one will hit (assuming no intervening accesses). The sixteenth (i=15) element's address is 0x803c, the index and tag are the same as the others so they all share a block. For i=16, the address is 0x8040, the index here is different. The tags for all elements (up to i=399) are the same, so one address cannot push another out of the cache.

Since every address is accessed once and there are 16 elements per block, the hit ratio is 15/16 = .9375. (Note that 400 is divisible by 16.)

for( j=0; j<2; j++) for( i=0; i<400; i++) total += a[i];The hit ratio during the first iteration of the outer loop is 15/16, as in the previous example. In the second iteration, all accesses hit, for an overall hit ratio of (15+16)/(16+16) = 0.96875.

for( j=0; j<2; j++) for( i=0; i<400; i++) total += a[i] + a[ i + 0x08000000 ];The address for

`a[ i + 0x08000000 ]`

is in the same set as
`a[i]`

(their indices are the same but their tags are
different). Since the cache is 4-way there is room for four addresses
in a set, (and more than enough room for 2). If random replacement
were used it would be possible that the second element could replace
the first one even if there were no other addresses in
the set. With LRU replacement, this isn't a problem.
Because `a[ i + 0x08000000 ]`

and `a[i]`

behave identically and do not interfere, the hit ratio is the
same as the previous example, 0.96875.

for( j=0; j<2 j++) for( i=0; i<16; i++) total += a[ i * 0x00800000 ]; /* Wow, what a big array! */Note that it's i × 0x00800000. (Also note that with the multiplier used in the original practice exam, the array would not fit in a 32-bit address space. The multiplier above should work.) Each value of i results in an address with the same index but a different tag. Therefore, all 16 addresses fall into the same set. The fifth one to enter pushes the first one out. Tragically (from a performance point of view) this cache never hits. Because consecutive accesses are to far-away elements, there are no hits due to spatial locality (next element in same block), as there were in the previous loops. Since each element is removed before it's used again, there are no hits due to temporal locality (the second iteration of the outer loop).

With random replacement there is a small chance that a cache entry brought in in the first iteration will survive long enough to be accessed again in the second iteration.

The first loop iteration starts at cycle 4 and ends at cycle 23 (the time of the first instruction of the loop [ld] is in the IF stage), for a total of 19 cycles. The second iteration will be different.

Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 muld IF ID 5:A1 5:A2 5:A5 5:A4 5:WB addd IF ID 3:RS 3:RS 3:RS 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB muld IF ID 6:RS 6:RS 6:RS 6:M1 6:M2 6:M3 6:M4 6:WB 6:WB subd IF ID 4:RS 4:RS 4:RS 4:A1 4:A2 4:A3 4:A4 4:WB 4:WBAs can be seen below, the second iteration starts at cycle 23 and ends at 40, for a total of 17 cycles. The first iteration takes longer because of the instructions preceding the loop which have true dependencies with loop instructions and which are using both add-unit reservation stations when the first looploop:Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ld IF ID 2:EX 2:MI 2:WI IF ID subd IF ID 3:RS 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB ld IF ID 1:EX 1:MI 1:WB subd IF ID 4:RS 4:RS 4:A1 4:A2 4:A3 4:A4 4:WB ld IF ID 2:EX 2:MI 2:WB subd IF ID 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB subi IF ID 1:EX 1:MI 1:WB addi IF ID 2:EX 2:MI 2:WB bnez IF ID 1:EX 1:MI 1:WB Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

`subd`

instruction
executes.

Time 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ld IF ID 2:EX 2:MI 2:WI IF ID 2:EX subd IF ID 4:RS 4:RS 4:RS 4:A1 4:A2 4:A3 4:A4 4:WB IF ID ld IF ID 1:EX 1:MI 1:WI IF subd 4:A1 4:A2 4:A3 4:A4 4:WB IF ID 3:RS 3:RS 3:RS 3:RS 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB ld IF ID 2:EX 2:MI 2:WI subd 3:RS 3:A1 3:A2 3:A3 3:A4 3:WB IF ID 4:RS 4:RS 4:RS 4:RS 4:RS 4:A1 4:A2 subi ID 1:EX 1:MI 1:WI IF ID 1:EX 1:MI 1:WI addi IF ID 2:EX 2:MI 2:WI 2:WI IF ID 2:EX 2:MI 2:WI 2:WI bnez IF ID 1:EX 1:MI 1:WI IF ID 1:EX 1:MI 1:WI Time 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40With a large number of iterations, 17 cycles and 9 instructions per iteration, the CPI is 1.89

A disadvantage of small page sizes is that the overhead in managing them is higher. That overhead is incurred in larger page tables and either a larger TLB or more frequent TLB misses. Another disadvantage is that smaller page sizes can leave spatial locality unexploited.

`addi`

might complete before a
floating-point overflow were detected. Suppose, after handling the
exception, the handler returns to to the `addi`

instruction. Then `addi`

would execute a second time when
only one execution was intended. In this particular case the
`addi`

could be skipped (assuming the `div`

was
not to be re-started), but it's not possible to find a correct return
point when any subset of following instructions may have completed.
For example, if both the `addi`

and `subi`

complete but the mul does not, there would be no place for the handler
return to.

div f0, f1, f2 addi r1, r1, #1 mul f3, f0, f4 subi r2, r2, #2

The best number of delay slots to use is determined by implementation technology. The number of delay slots chosen by the ISA designers might work well with the contemporary technology, but will be less useful in future implementations. For example changing the number of pipeline stages might change the optimal number of delay slots. The number of delay slots is fixed in the ISA, so future implementations must support them, which means added complexity with no benefit.

David M. Koppelman - koppel@ee.lsu.edu | Modified 8 May 1997 15:43 (20:43 UTC) |