EE 4720 Computer Architecture - HW 8 Solution

Problem 1

The original program with line numbers added:

1: add r1, r2, r3
2: sub r6, r7, r8
3: lw  r10, 0(r20)
4: add r11, r10, r12
5: sub r14, r1, r9
6: add r1, r14, r15
7: sub r16, r17, r18
8: add r19, r21, r22
9: sw  0(r20), r6

The instructions are rearranged and placed in groups of 3 for 3VDLX. The first three instructions can be, and are, grouped together because they don't use any values they produce. The next three, on lines 4, 5, and 6, cannot be grouped together because the value produced on line 5 is needed on line 6. The inefficient solution is to keep the instructions in the same order and use nops to avoid hazards, starting with the load latency:

add r1, r2, r3;    sub r6, r7, r8;     lw  r10, 0(r20)
nop;               nop;                nop
add r11, r10, r12; sub r14, r1, r9;    nop
add r1, r14, r15;  sub r16, r17, r18;  add r19, r21, r22
sw  0(r20), r6     nop;                nop

The first set of nops is needed because of the load instruction. The nop in the third 3VDLX instruction is needed because of the true dependency between lines 5 and 6 in the original program. The nops in the last 3VDLX instruction would be needed if we had no further instructions (which is unlikely when the last instruction is not a CTI, but that's where the problem ends).

The instructions can be rearranged for efficient execution. The instructions at lines 5, 7 and 8 can be placed in the second 3VDLX instruction. The remaining can be placed in the third 3VDLX instruction yielding:

add r1, r2, r3;    sub r6, r7, r8;    lw  r10, 0(r20)
sub r16, r17, r18; add r19, r21, r22; sub r14, r1, r9
add r11, r10, r12; add r1, r14, r15;  sw  0(r20), r6

Since the VLIW instructions execute as a unit, there is no need to distinguish between the separate parts of a pipeline stage, as is done for superscalar.

add  sub  lw   IF  ID  EX  MEM WB
sub  add  sub      IF  ID  EX  MEM WB 
add  add  sw           IF  ID  EX  MEM WB

Problem 2

Each pipeline stage has room for three instructions, superscripts are used to distinguish the parts. To save space, M is used for the MEM stage.

The first three instructions can start executing without delay. Of the next three, only the one on line 5 can start, the add on line 4 must wait for the load and the add on line 6 must wait for the sub on line 5. With two instructions in ID the last group of three instructions are stalled at time 3.

Time  0   1   2   3   4   5   6   7
add   IF¹ ID¹ EX¹ M¹  WB¹                
sub   IF² ID² EX² M²  WB²                
lw    IF³ ID³ EX³ M³  WB³                
add       IF¹ ID¹     EX¹ M¹  WB¹        
sub       IF² ID² EX² M²  WB²        
add       IF³ ID³     EX³ M³  WB³    
sub           IF¹     ID¹ EX¹ M¹  WB¹
add           IF²     ID² EX² M²  WB²
sw            IF³     ID³ EX³ M³  WB³

Problem 3

Reservation stations, as described in class, add an extra cycle of latency because of the complexity of using the CDB (in the WB "stage"). The reservation station numbers associated with the functional units are:

Int   1  2
L/S   3  4
Int   5  6
L/S   7  8
Int   9 10
L/S  11 12

where Int refers to an integer execution unit and L/S refers to a load/store unit. Note that reservation stations are dedicated to functional units. If ready instructions were waiting in reservation stations 1 and 2, only one could start in a cycle, even if the other two integer execution units were free.

The pipeline notation is the same used in class, for example, 5:EX³ indicates that execution unit 3 is executing an instruction from reservation station 5.

Load and store instructions need an ALU to compute addresses, in this solution a load/store unit has its own ALU, that stage is indicated by AD.

Execution is given below:

Time  0   1       2       3       4       5       6     7
add   IF¹ ID¹   1:EX¹   1:WB¹                                    
sub   IF² ID²   5:EX²   5:WB²                                    
lw    IF³ ID³  11:AD³  11:M³   11:WB³                            
add       IF¹     ID¹   2:RS    2:RS    2:EX¹   2:WB¹            
sub       IF²     ID²   6:RS    6:EX²   6:WB²                    
add       IF³     ID³   9:RS    9:RS    9:RS    9:EX³  9:WB³     
sub               IF¹     ID¹   1:EX¹   1:WB¹                    
add               IF²     ID²   5:EX²   5:WB²                    
sw                IF³     ID³  12:AD³  12:M³   12:WB³

Problem 4

Memory devices, which have 16-bit entries, are paired to provide the 32-bit words needed. Since 4-byte words are being fetched, use address bits 2..19 to index (as the address for) the memory devices. Bits 20..22 of the address can be used to select the memory pair with the correct data. The solution should include a diagram.

David M. Koppelman - koppel@ee.lsu.edu

Modified 2 May 1997 18:34 (23:34 UTC)