EE 4720 Computer Architecture - Lecture Slides, Set 10

ILP and Dynamic Scheduling

Material Covered

Sections 4.1 and 4.2.

Instruction-Level Parallelism

Definition: Parallelism between instructions in programs, or those portions of programs, written for a serial (not parallel) ISA.

Since ISA is defined in terms of one-at-a-time execution ...

... programs have ILP by "accident".

In other types of parallelism ...

... programmer (or compiler) specifies multiple paths of execution ...

... each with its own program counter (or thread).

ILP is between instruction in a single thread (using a single PC).

ILP is a property of programs ...

... which hardware may not be able to fully exploit.

Dependency

Definition: Property between two instructions in dynamic instruction sequence indicating that one depends on the other.

Let A and B can be two instructions ...

... with B following A ...

... by any distance.

If instruction B depends on instruction A ...

... A must execute before B ...

... unless special measures are taken.

Special measures are not always possible.

Three (or Four) Dependency Types

Data (True) Dependency

Name Dependencies

Control Dependency

Data or True Dependencies

A writes register X, B reads (what A wrote in) register X.

Not practical for HW to remove this dependency.

Akin to RAW hazard.

Name Dependencies (Two Kinds)

Anti-dependence:

A reads X, then B writes X.

Obviously, A must read before B writes.

Akin to WAR hazard.

Output dependence:

A writes X, then B writes X

A must write before B.

Akin to WAW hazard.

Control Dependencies

A is a CTI, B is any instruction.

B is control dependent on A if B being executed depends on A.

Hazards v. Dependencies

Same thing can go wrong in both cases ...

... but a hazard is a property of pipelines ...

... while a dependency is a property of instruction streams.

Pipeline will stall if ...

... there is a dependency between instructions ...

... and a corresponding hazard in pipeline.

DLX Pipeline For Chapter 4

Illustrated below.

Uses branch hardware described in section 3.5 ...

... and so branch delay is only one cycle.

(Figures 3.22 and 3.32 in section 3.5 are incorrect, use illustration below)

All FP units are fully pipelined and each uses 4 stages.

A FP memory and write-back stage are added ...

... called M_F and WB_F

Bypass paths (not shown below) are included.

See loop example for details.

DLX pipeline used in parts of chapter 4.

Loop Example

In C

for( i=40320; i>0; i--) a[i]+=c;

In DLX assembler

LOOP:
 LD   F0, 0(R1)
 ADDD F2, F0, F4
 SD   0(R1), F2
 SUBI R1, R1, #8
 BNEZ R1, LOOP

Note that no real compiler would start array at address 8.

Instead, there would be a second subtract such as: SUBI R2,R1,R3, where R3 contains the address before the first element. The branch would test R2 instead of R1.

Execution On Pipeline.


Time  0   1   2   3   4   5   6   7   8   9   10  11  12
LD    IF  ID  EX  MEM WB                      IF  ID  EX
ADDD      IF  ID      A₁  A₂  A₃  A₄   M_F  WB_F     IF  ID
SD            IF      ID  EX          MEM WB          IF
SUBI                  IF  ID          EX  MEM WB
BNEZ                      IF          ID      EX  MEM WB

The data provided by LD for ADDD will not be available until after LD finishes MEM, so at cycle 3 ADDD stalls.

SD needs its data, produced by ADDD, at the beginning of MEM, so MEM is delayed until cycle 8.

BNEZ first tests the branch condition when in ID at cycle 8. The condition is being computed by SUBI in cycle 8 and so won't be available to BNEZ until cycle 9.

Because there are separate memory and writeback stages for integer and floating-point operations, ADDD and SD can execute as in cycles 8 and 9. If instead the pipeline were designed so that the integer and FP operations used the same memory and WB stages, the SD instruction would be stalled.

The second iteration is shown in green.

The loop loops with the fetch of the branch target, LD, in cycle 10 ...

... and so the loop takes 10 cycles per iteration.

Dependencies in Loop

The loop above has many true dependencies ...

... in addition to control and anti dependencies.

True dependencies limit the ways instructions can be rearranged ...

... and so limit performance improvement attainable ...

... by rearranging instructions to avoid stalls.

As will be shown in section 4.2 ...

... hardware can correctly execute instructions out of order ...

... even when output- and anti-dependencies are violated

However, order determined by true dependencies must be maintained.

Therefore, hardware can't do much with loop above

Simple (and limited) Change to Loop

Swap SD and SUBI, and change displacement in SD:

LOOP:
 LD   F0, 0(R1)
 ADDD F2, F0, F4
 SUBI R1, R1, #8
 SD   8(R1), F2  ! Displacement compensates for movement of SUBI.
 BNEZ R1, LOOP

Execution of the loop:

Time  0   1   2   3   4   5   6   7   8   9   10  11  12 
LD    IF  ID  EX  MEM WB          IF  ID  EX  MEM WB 
ADDD      IF  ID      A₁  A₂  A₃  A₄   M_F WB_F
                                      IF  ID      A₁  A₂
SUBI          IF      ID  EX  MEM WB      IF      ID  EX
SD                    IF  ID  EX      MEM WB      IF  ID
BNEZ                      IF  ID      EX  MEM WB      IF

Here SD stalls for one rather than two cycles.

The condition is ready for BNEZ when it enters ID, so it doesn't stall there.

The branch target is computed at the end of ID so that fetching the first instruction of the next iteration can start ...

... in cycle 7 using aggressive hardware ...

... or cycle 8 using conservative hardware. (Hint: Because of the stall.)

The second iteration is shown in green.

The second execution of ADDD starts in cycle 8 (shown on a separate line to save space) even before the first execution finishes.

This change improved performance slightly but any further improvements are limited by true dependencies.

To give the hardware maximum flexibility compilers and programmers need to avoid unnecessary true dependencies.

The index in the loop (i or R1) is the culprit here.

Some true dependencies can be removed using a technique called loop unrolling.

Loop Unrolling

Loop unrolled by duplicating body several times:

for(i=40320, i>0; i-=8){
 a[i]-=c;    /* line 2 */
 a[i-1]-=c;  /* line 3 */
 a[i-2]-=c;  /* line 4 */
}

There are no true dependencies between any pair of statements in body, so they can be executed in any order.

As a bonus, index arithmetic and testing is done 1/3 as often.

Loop Unrolling In Real Life

If loop unrolled three times (as above) either ...

... number of iterations must be divisible by three ...

... or unrolled loop must be followed by 1 or 2 copies of body:

for(i=40343, i>16; i-=8){ /* Loop index not divisible by 3. */
 a[i]-=c;    /* line 2 */
 a[i-1]-=c;  /* line 3 */
 a[i-2]-=c;  /* line 4 */
}
 a[8]-=c;
 a[16]-=c;

Fortunately, this tedium is handled by compilers.

Execution of Unrolled Loops

Consider assembler:

LOOP:

 LD   F0, 0(R1)
 ADDD F2, F0, F4
 SD   0(R1), F2

 LD   F0, -8(R1)
 ADDD F2, F0, F4
 SD   -8(R1), F2

 LD   F0, -16(R1)
 ADDD F2, F0, F4
 SD   -16(R1), F2

 SUBI R1, R1, #24
 BNEZ R1, LOOP

If the hardware could handle anti-dependencies (e.g., on F0) ...

... then it could execute the three loads first ...

... then the three adds, the subtraction, three stores, and finally the branch.

All of this would be done without adjusting instructions.

Scheduling Instructions in Unrolled Loop By Hand

The above DLX pipe stalls when faced with name dependencies ...

... and could not start instructions out of order.

So there would be many stalls executing unrolled loop above.

For such pipelines the code can be scheduled (rearranged) to avoid ...

... every stall but one.

LOOP:

 LD   F0, 0(R1)
 LD   F10, -8(R1)
 LD   F20, -16(R1)
 ADDD F2, F0, F4
 ADDD F12, F10, F4
 ADDD F22, F20, F4
 SUBI R1, R1, #24
 SD   24(R1), F2
 SD   16(R1), F12
 SD   8(R1), F22
 BNEZ R1, LOOP

Note that registers were renamed to remove name dependencies.

Dynamic Scheduling

A processor that can rearrange instructions it is executing ...

... is said to use dynamic scheduling.

Two dynamic scheduling techniques described in text ...

... scoreboarding ...

... and Tomasulo's Approach.

Scoreboarding Overview

Instructions wait in ID until ...

... a functional unit (e.g., DIV) is available ...

... and there is no executing instruction that will write same register.

(E.g., ADDD F0,F2,F4 would wait if MULF F0,F6,F8 were executing.)

Instructions wait in functional units ...

... until all their operands are available ...

... then they read them and start executing.

Instructions wait at functional unit outputs ...

... long enough to avoid any WAR hazards ...

... before proceeding to write their results.

In scoreboarding implementation described in book ...

... functional units are not pipelined ...

... instructions proceed from FUs directly to writeback.

Information to manage execution is stored in a central structure ...

... called a scoreboard.

The system using scoreboarding can execute instr. out of order ...

... but will stall when faced with name dependencies.

Tomasulo's Approach Overview

Using a technique called register renaming ...

... can rearrange instructions ...

... despite name dependencies.

Each functional unit has one or more reservation stations.

A reservation station may be ...

... empty ...

... holding a waiting instruction ...

... or holding an executing instruction.

Each reservation station is given a unique number ...

... used to identify future result of held instruction.

For example, consider:

DIVD F0,F2,F4
ADDD F6,F8,F0
SUBD F0,F10,F12
MULD F14,F0,F10

and suppose DIVD F0,F2,F4 were in reservation station 3 ...

... and SUBD F0,F10,F12 were in reservation station 4.

The in the hardware, ADDD would use 3 to identify operand F0 ...

... while MULD would use 4 to identify operand F0 ...

... thus avoiding anti-dependency between ADDD and SUBD.

In ID instruction reads registers in usual way ...

... a free reservation station is chosen based on instr. opcode ...

... instruction moves to reservation station at end of cycle.

Value actually read from register may be ...

... a register value (the normal case) ...

... or a reservation station number.

Functional unit outputs, using a common data bus, connect to ...

... register storage ...

... and reservation stations.

When a functional unit has a result ready ...

... it gains control of the bus ...

... and places the value and the instruction reservation station number on the bus.

All reservation stations watch this bus ...

... so they can replace reservation station numbers in their operand fields with the appropriate values (if necessary).

Moving back to the ID stage ...

... when a reservation station is chosen for an instruction ...

... the reservation station number is written into the register file ...

... replacing the value held in the destination register.

Writes to the register file are done when ...

... a result is placed on the common data bus ...

... and the reservation station number on the bus matches one stored in the register file.

Note: When reading this the second time, be sure not to confuse a reservation station with the numbers of other reservation stations possibly stored in operand fields.

Practicality of Tomasulo's Approach

Tomasulo's approach is relatively expensive to implement.

It's practical (unavoidable may be a better word) for ISAs that have a small number of registers because a programmer cannot rename registers (as was done in the loop example above) without soon running out of registers.

Is also practical in conjunction with other aggressive techniques ...

... such as multiple issue and speculative execution.

David M. Koppelman - koppel@ee.lsu.edu

Modified 8 Apr 1997 9:44 (14:44 UTC)