EE 4720 Computer Architecture - HW 8 Solution


Problem 1

The original program with line numbers added:

1: add r1, r2, r3
2: sub r6, r7, r8
3: lw  r10, 0(r20)
4: add r11, r10, r12
5: sub r14, r1, r9
6: add r1, r14, r15
7: sub r16, r17, r18
8: add r19, r21, r22
9: sw  0(r20), r6
The instructions are rearranged and placed in groups of 3 for 3VDLX. The first three instructions can be, and are, grouped together because they don't use any values they produce. The next three, on lines 4, 5, and 6, cannot be grouped together because the value produced on line 5 is needed on line 6. The inefficient solution is to keep the instructions in the same order and use nops to avoid hazards, starting with the load latency:
add r1, r2, r3;    sub r6, r7, r8;     lw  r10, 0(r20)
nop;               nop;                nop
add r11, r10, r12; sub r14, r1, r9;    nop
add r1, r14, r15;  sub r16, r17, r18;  add r19, r21, r22
sw  0(r20), r6     nop;                nop
The first set of nops is needed because of the load instruction. The nop in the third 3VDLX instruction is needed because of the true dependency between lines 5 and 6 in the original program. The nops in the last 3VDLX instruction would be needed if we had no further instructions (which is unlikely when the last instruction is not a CTI, but that's where the problem ends).

The instructions can be rearranged for efficient execution. The instructions at lines 5, 7 and 8 can be placed in the second 3VDLX instruction. The remaining can be placed in the third 3VDLX instruction yielding:

add r1, r2, r3;    sub r6, r7, r8;    lw  r10, 0(r20)
sub r16, r17, r18; add r19, r21, r22; sub r14, r1, r9
add r11, r10, r12; add r1, r14, r15;  sw  0(r20), r6
Since the VLIW instructions execute as a unit, there is no need to distinguish between the separate parts of a pipeline stage, as is done for superscalar.

add  sub  lw   IF  ID  EX  MEM WB
sub  add  sub      IF  ID  EX  MEM WB 
add  add  sw           IF  ID  EX  MEM WB 

Problem 2

Each pipeline stage has room for three instructions, superscripts are used to distinguish the parts. To save space, M is used for the MEM stage.

The first three instructions can start executing without delay. Of the next three, only the one on line 5 can start, the add on line 4 must wait for the load and the add on line 6 must wait for the sub on line 5. With two instructions in ID the last group of three instructions are stalled at time 3.

Time  0   1   2   3   4   5   6   7
add   IF1 ID1 EX1 M1  WB1                
sub   IF2 ID2 EX2 M2  WB2                
lw    IF3 ID3 EX3 M3  WB3                
add       IF1 ID1     EX1 M1  WB1        
sub       IF2 ID2 EX2 M2  WB2        
add       IF3 ID3     EX3 M3  WB3    
sub           IF1     ID1 EX1 M1  WB1
add           IF2     ID2 EX2 M2  WB2
sw            IF3     ID3 EX3 M3  WB3

Problem 3

Reservation stations, as described in class, add an extra cycle of latency because of the complexity of using the CDB (in the WB "stage"). The reservation station numbers associated with the functional units are:

Int   1  2
L/S   3  4
Int   5  6
L/S   7  8
Int   9 10
L/S  11 12
where Int refers to an integer execution unit and L/S refers to a load/store unit. Note that reservation stations are dedicated to functional units. If ready instructions were waiting in reservation stations 1 and 2, only one could start in a cycle, even if the other two integer execution units were free.

The pipeline notation is the same used in class, for example, 5:EX3 indicates that execution unit 3 is executing an instruction from reservation station 5.

Load and store instructions need an ALU to compute addresses, in this solution a load/store unit has its own ALU, that stage is indicated by AD.

Execution is given below:

Time  0   1       2       3       4       5       6     7
add   IF1 ID1   1:EX1   1:WB1                                    
sub   IF2 ID2   5:EX2   5:WB2                                    
lw    IF3 ID3  11:AD3  11:M3   11:WB3                            
add       IF1     ID1   2:RS    2:RS    2:EX1   2:WB1            
sub       IF2     ID2   6:RS    6:EX2   6:WB2                    
add       IF3     ID3   9:RS    9:RS    9:RS    9:EX3  9:WB3     
sub               IF1     ID1   1:EX1   1:WB1                    
add               IF2     ID2   5:EX2   5:WB2                    
sw                IF3     ID3  12:AD3  12:M3   12:WB3            

Problem 4

Memory devices, which have 16-bit entries, are paired to provide the 32-bit words needed. Since 4-byte words are being fetched, use address bits 2..19 to index (as the address for) the memory devices. Bits 20..22 of the address can be used to select the memory pair with the correct data. The solution should include a diagram.
ECE Home Page 4720 Home Page Up
David M. Koppelman - koppel@ee.lsu.edu
Modified 2 May 1997 18:34 (23:34 UTC)