1: add r1, r2, r3 2: sub r6, r7, r8 3: lw r10, 0(r20) 4: add r11, r10, r12 5: sub r14, r1, r9 6: add r1, r14, r15 7: sub r16, r17, r18 8: add r19, r21, r22 9: sw 0(r20), r6The instructions are rearranged and placed in groups of 3 for 3VDLX. The first three instructions can be, and are, grouped together because they don't use any values they produce. The next three, on lines 4, 5, and 6, cannot be grouped together because the value produced on line 5 is needed on line 6. The inefficient solution is to keep the instructions in the same order and use nops to avoid hazards, starting with the load latency:
add r1, r2, r3; sub r6, r7, r8; lw r10, 0(r20) nop; nop; nop add r11, r10, r12; sub r14, r1, r9; nop add r1, r14, r15; sub r16, r17, r18; add r19, r21, r22 sw 0(r20), r6 nop; nopThe first set of nops is needed because of the load instruction. The nop in the third 3VDLX instruction is needed because of the true dependency between lines 5 and 6 in the original program. The nops in the last 3VDLX instruction would be needed if we had no further instructions (which is unlikely when the last instruction is not a CTI, but that's where the problem ends).
The instructions can be rearranged for efficient execution. The instructions at lines 5, 7 and 8 can be placed in the second 3VDLX instruction. The remaining can be placed in the third 3VDLX instruction yielding:
add r1, r2, r3; sub r6, r7, r8; lw r10, 0(r20) sub r16, r17, r18; add r19, r21, r22; sub r14, r1, r9 add r11, r10, r12; add r1, r14, r15; sw 0(r20), r6Since the VLIW instructions execute as a unit, there is no need to distinguish between the separate parts of a pipeline stage, as is done for superscalar.
add sub lw IF ID EX MEM WB sub add sub IF ID EX MEM WB add add sw IF ID EX MEM WB
The first three instructions can start executing without delay. Of the next three, only the one on line 5 can start, the add on line 4 must wait for the load and the add on line 6 must wait for the sub on line 5. With two instructions in ID the last group of three instructions are stalled at time 3.
Time 0 1 2 3 4 5 6 7 add IF1 ID1 EX1 M1 WB1 sub IF2 ID2 EX2 M2 WB2 lw IF3 ID3 EX3 M3 WB3 add IF1 ID1 EX1 M1 WB1 sub IF2 ID2 EX2 M2 WB2 add IF3 ID3 EX3 M3 WB3 sub IF1 ID1 EX1 M1 WB1 add IF2 ID2 EX2 M2 WB2 sw IF3 ID3 EX3 M3 WB3
Int 1 2 L/S 3 4 Int 5 6 L/S 7 8 Int 9 10 L/S 11 12where Int refers to an integer execution unit and L/S refers to a load/store unit. Note that reservation stations are dedicated to functional units. If ready instructions were waiting in reservation stations 1 and 2, only one could start in a cycle, even if the other two integer execution units were free.
The pipeline notation is the same used in class, for example, 5:EX3 indicates that execution unit 3 is executing an instruction from reservation station 5.
Load and store instructions need an ALU to compute addresses, in this solution a load/store unit has its own ALU, that stage is indicated by AD.
Execution is given below:
Time 0 1 2 3 4 5 6 7 add IF1 ID1 1:EX1 1:WB1 sub IF2 ID2 5:EX2 5:WB2 lw IF3 ID3 11:AD3 11:M3 11:WB3 add IF1 ID1 2:RS 2:RS 2:EX1 2:WB1 sub IF2 ID2 6:RS 6:EX2 6:WB2 add IF3 ID3 9:RS 9:RS 9:RS 9:EX3 9:WB3 sub IF1 ID1 1:EX1 1:WB1 add IF2 ID2 5:EX2 5:WB2 sw IF3 ID3 12:AD3 12:M3 12:WB3
David M. Koppelman - koppel@ee.lsu.edu | Modified 2 May 1997 18:34 (23:34 UTC) |