# EE 4720 Computer Architecture - HW 8 Solution

### Problem 1

The original program with line numbers added:

```1: add r1, r2, r3
2: sub r6, r7, r8
3: lw  r10, 0(r20)
5: sub r14, r1, r9
7: sub r16, r17, r18
9: sw  0(r20), r6
```
The instructions are rearranged and placed in groups of 3 for 3VDLX. The first three instructions can be, and are, grouped together because they don't use any values they produce. The next three, on lines 4, 5, and 6, cannot be grouped together because the value produced on line 5 is needed on line 6. The inefficient solution is to keep the instructions in the same order and use nops to avoid hazards, starting with the load latency:
```add r1, r2, r3;    sub r6, r7, r8;     lw  r10, 0(r20)
nop;               nop;                nop
add r11, r10, r12; sub r14, r1, r9;    nop
add r1, r14, r15;  sub r16, r17, r18;  add r19, r21, r22
sw  0(r20), r6     nop;                nop
```
The first set of nops is needed because of the load instruction. The nop in the third 3VDLX instruction is needed because of the true dependency between lines 5 and 6 in the original program. The nops in the last 3VDLX instruction would be needed if we had no further instructions (which is unlikely when the last instruction is not a CTI, but that's where the problem ends).

The instructions can be rearranged for efficient execution. The instructions at lines 5, 7 and 8 can be placed in the second 3VDLX instruction. The remaining can be placed in the third 3VDLX instruction yielding:

```add r1, r2, r3;    sub r6, r7, r8;    lw  r10, 0(r20)
sub r16, r17, r18; add r19, r21, r22; sub r14, r1, r9
```
Since the VLIW instructions execute as a unit, there is no need to distinguish between the separate parts of a pipeline stage, as is done for superscalar.

```add  sub  lw   IF  ID  EX  MEM WB
sub  add  sub      IF  ID  EX  MEM WB
```

### Problem 2

Each pipeline stage has room for three instructions, superscripts are used to distinguish the parts. To save space, M is used for the MEM stage.

The first three instructions can start executing without delay. Of the next three, only the one on line 5 can start, the add on line 4 must wait for the load and the add on line 6 must wait for the sub on line 5. With two instructions in ID the last group of three instructions are stalled at time 3.

```Time  0   1   2   3   4   5   6   7
add   IF1 ID1 EX1 M1  WB1
sub   IF2 ID2 EX2 M2  WB2
lw    IF3 ID3 EX3 M3  WB3
add       IF1 ID1     EX1 M1  WB1
sub       IF2 ID2 EX2 M2  WB2
add       IF3 ID3     EX3 M3  WB3
sub           IF1     ID1 EX1 M1  WB1
add           IF2     ID2 EX2 M2  WB2
sw            IF3     ID3 EX3 M3  WB3
```

### Problem 3

Reservation stations, as described in class, add an extra cycle of latency because of the complexity of using the CDB (in the WB "stage"). The reservation station numbers associated with the functional units are:

```Int   1  2
L/S   3  4
Int   5  6
L/S   7  8
Int   9 10
L/S  11 12
```
where Int refers to an integer execution unit and L/S refers to a load/store unit. Note that reservation stations are dedicated to functional units. If ready instructions were waiting in reservation stations 1 and 2, only one could start in a cycle, even if the other two integer execution units were free.

The pipeline notation is the same used in class, for example, 5:EX3 indicates that execution unit 3 is executing an instruction from reservation station 5.

Load and store instructions need an ALU to compute addresses, in this solution a load/store unit has its own ALU, that stage is indicated by AD.

Execution is given below:

```Time  0   1       2       3       4       5       6     7
sub   IF2 ID2   5:EX2   5:WB2
lw    IF3 ID3  11:AD3  11:M3   11:WB3
add       IF1     ID1   2:RS    2:RS    2:EX1   2:WB1
sub       IF2     ID2   6:RS    6:EX2   6:WB2
add       IF3     ID3   9:RS    9:RS    9:RS    9:EX3  9:WB3
sub               IF1     ID1   1:EX1   1:WB1