Instruction fetches for instructions within loop. (Since they are fetched multiple times.)
The store of the incremented histogram element, since the same address was used to load the element.
SW 0(r10), rll ! hist[e] = r11
Accesses to the array
elements. (Since they are
done sequentially.) For example:
LH r10, 0(r10) ! r10 = array[ i*JSIZE + j ]
Accesses to the hist
array, since the pattern of access
depends on the value of elements of array
, which
are not known in advance.
LW r11, 0(r10) ! r11 = hist[ e ];
The static count is simply the number of instructions, 24. The
dynamic count is 3275 instructions, the table below shows the dynamic
count of each instruction. The labels (e.g., NEXTI:
) are
included for clarity.
Dynamic | Instruction |
1 | ADDI r10, r0, #20 |
1 | MOVI2FP f0, r10 |
1 | ADDI r1, r0, #0 |
NEXTI: | |
11 | SGEI r10, r1, #10 |
11 | BNEZ r10, DONEI |
10 | ADDI r2, r0, #0 |
NEXTJ: | |
210 | SGEI r10, r2, #20 |
210 | BNEZ r10, DONEJ |
200 | MOVI2PF f1, r1 |
200 | MULT f1, f1, f0 |
200 | MOVFP2I r10, f1 |
200 | ADD r10, r10, r2 |
200 | SLLI r10, r10, #1 |
200 | ADD r10, r10, r4 |
200 | LH r10, 0(r10) |
200 | SLLI r10, r10, #2 |
200 | ADD r10, r10, r3 |
200 | LW r11, 0(r10) |
200 | ADDI r11, rll, #1 |
200 | SW 0(r10), rll |
200 | ADDI r2, r2, #1 |
200 | J NEXTJ |
DONEJ: | |
10 | ADDI r1, r1, #1 |
10 | J NEXTI |
DONEI: |
The scaled addressing mode makes accessing both arrays much easier.
Element array[i*JSIZE+j]
is still loaded into
a register, but the histogram element is incremented in a
memory-memory add instruction.
ADDI r1, r0, #0 ! i=0 NEXTI: SGEI r10, r1, #10 ! if i >= ISIZE ... BNEZ r10, DONEI ! ... exit loop. ADDI r2, r0, #0 ! j=0 NEXTJ: SGEI r10, r2, #20 ! if j >= JSIZE ... BNEZ r10, DONEJ ! ... exit loop MULTI r10, r1, JSIZE ! r10 = i * JSIZE ADD r10, r10, r2 ! r10 = i * JSIZE + j LH r10, 0(r4)[r10] ! d = 2; r10 = array[r10] ADDI O(r3)[r10],0(r3)[r10],#1 ! d=4; hist[r10] = hist[r10]+1; ADDI r2, r2, #1 ! r2 = j+1 J NEXTJ DONEJ: ADDI r1, r1, #1 ! r1 = i+1 J NEXTI DONEI:
The new static count is 14, the dynamic count is 1674, the table below lists count by instruction.
1 | ADDI r1, r0, #0 |
NEXTI: | |
11 | SGEI r10, r1, #10 |
11 | BNEZ r10, DONEI |
11 | ADDI r2, r0, #0 |
NEXTJ: | |
210 | SGEI r10, r2, #20 |
210 | BNEZ r10, DONEJ |
200 | MULTI r10, r1, JSIZE |
200 | ADD r10, r10, r2 |
200 | LH r10, 0(r4)[r10] |
200 | ADDI O(r3)[r10],0(r3)[r10],#1 |
200 | ADDI r2, r2, #1 |
200 | J NEXTJ |
DONEJ: | |
10 | ADDI r1, r1, #1 |
10 | J NEXTI |
DONEI: |
Need to solve: ICnew CPInew / fnew = ICold CPIold / fold for fnew. Solving yields, fnew = 1 GHz ICnew / ICold = 511 MHz
So, even though an implementation using complex addressing modes might have a slower clock, it can still run faster. Note that the assumption that the CPI is the same is unrealistic, since it would take much longer to execute, for example, an instruction that had three memory operands.
Need to solve: ICnew CPInew / fnew = ICold CPIold / fold for CPInew. The resulting instruction execution time is CPInew = CPIold ICold / ICnew = 0.489 CPI.
So, even though CPI will suffer in an ISA using elaborate addressing, it is still possible to gain performance.
Some notes on Sparc assembler:
%i0,%i1,...,%i7, %l0,%l1,...,%l7, %o0,%o1,...,%o7,
%g0,%g1,...,%g7
. The letters are for input, local,
output, and general. Register %i6
is also
called %fp
, frame pointer, and points to the current stack frame.
Register %g0
, like r0
in DLX, is always zero.
add %l0,%l1,%l2
means %l2 = %l1 + %l0
.
array
and histo
. (Variables i
,
j
, and e
are loaded differently since they
are local to the procedure and so are stored on the stack.)
mov %r1,%r2
is a synthetic instruction,
the assembler emits code for a or %g0,%r1,%r2
The output of the Sun compiler is shown below. The compiler helpfully shows the source lines corresponding to the assembler instructions, and also breaks code into basic blocks which is convenient for finding the dynamic instruction count. (A basic block is a group of instructions that can only be entered from the first instructions [can't branch into the middle] and can only be exited from the last instruction [only the last instruction can be a CTI]. All instructions in a basic block are executed the same number of times.)
Static count: 55 (yes, nops count, they take space). Dynamic count, 6938 (yes, nops count, they at least take fetch and decode space). The count is detailed below, by basic block, followed by an annotated assembly language listing.
Block Number |
Static Count | Dynamic Count (Per instr) |
Dynamic Count (Total) |
1 | 6 | 1 | 6 |
2 | 6 | 10 | 60 |
3 | 34 | 200 | 6800 |
4 | 7 | 10 | 70 |
5 | 2 | 1 | 2 |
! File hw1.c: ! 2 extern short int *array; ! 3 extern int *hist; ! 4 ! 5 #define ISIZE 10 ! 6 #define JSIZE 20 ! 7 ! 8 void histo() ! 9 { ! 10 int i,j; ! 11 ! 12 for(i=0; i<ISIZE; i++) mov 0,%l0 st %l0,[%fp-4] ! Store reg. l0 into frame pointer - 4 = &i. ld [%fp-4],%l0 ! Load the 0 just written. cmp %l0,10 ! Compare i to 10, set condition bits. bge .L19 ! Branch if i greater than 10. nop ! Sparc has a branch delay slot, wasted here. ! block 2 .L20: .L17: ! 13 for(j=0; j<JSIZE; j++) mov 0,%l0 st %l0,[%fp-8] ! Store reg l0 info frame pointer - 8 = &j... ld [%fp-8],%l0 ! ...and load it back in. cmp %l0,20 ! Compare to 20 bge .L23 ! Branch if not smaller. nop ! block 3 .L24: .L21: ! 14 { ! 15 int e= array[ i * JSIZE + j ]; sethi %hi(array),%l0 ! Load hi bits of array pointer addr. or %l0,%lo(array),%l0 ! Load rest of array pointer address. ld [%l0+0],%l3 ! Load address of first element. ld [%fp-4],%l0 ! Load i into l0 sll %l0,2,%l2 ! l2 = i * 4 sll %l0,4,%l1 ! l1 = i * 16 add %l2,%l1,%l0 ! l0 = i * 20 = i * JSIZE; ld [%fp-8],%l1 ! l1 = j add %l0,%l1,%l0 ! l0 = j + i * JSIZE sll %l0,1,%l1 ! l1 = l0 * 2 = l0 * sizeof(short int) add %l3,%l1,%l0 ! l0 = &array[ j + i * JSIZE ] ldsh [%l0+0],%l0 ! l0 = array[ j + i * JSIZE ] = e sll %l0,16,%l0 ! l0 = l0 << 16 (Part 1 of sign ext.) sra %l0,16,%l0 ! l0 = l0 >> 16 (Part 2 of sign ext.) st %l0,[%fp-12] ! e = l0 ! 16 hist[ e ]++; sethi %hi(hist),%l0 ! Load hi bits of addr of hist pointer or %l0,%lo(hist),%l0 ! Load lo bits of addr of hist pointer ld [%l0+0],%l2 ! Load hist pointer, &hist[0] ld [%fp-12],%l0 ! Load e sll %l0,2,%l1 ! l1 = e * sizeof(int) add %l2,%l1,%l0 ! l0 = &hist[e] st %l0,[%fp-16] ! Write &hist[e] into stack ld [%fp-16],%l0 ! Read &hist[e] from stack. ld [%l0+0],%l0 ! l0 = hist[e] add %l0,1,%l1 ! l1 = hist[e] + 1 ld [%fp-16],%l0 ! Read &hist[e] from stack. st %l1,[%l0+0] ! Store hist[e]+1 ! 17 } ld [%fp-8],%l0 ! l0 = j add %l0,1,%l0 ! l0 = l0 + 1 st %l0,[%fp-8] ! Store new value of j. ld [%fp-8],%l0 ! And load it again. cmp %l0,20 ! Compare to twenty bl .L21 ! If less, continue. nop ! block 4 .L25: .L23: ld [%fp-4],%l0 ! l0 = i. add %l0,1,%l0 ! l0 = l0 + 1 st %l0,[%fp-4] ! i = l0, store new value of i. ld [%fp-4],%l0 ! Load again. cmp %l0,10 ! Compare to ten.. bl .L17 ! ...if less continue. nop ! block 5 .L26: .L19: .L15: jmp %i7+8 ! Return restore
Notes on optimized code
array
and
hist
declared static
, no other procedure
would "know" there names, and the compiler would not have to
load there values multiple times.
Static count, 26, dynamic count 2476. Alot fewer!
Block Number | Static Count |
Dynamic Count (Per Instr) |
Dynamic Count (Total) |
1 | 5 | 1 | 5 |
2 | 2 | 10 | 20 |
3 | 12 | 200 | 2400 |
4 | 5 | 10 | 50 |
5 | 2 | 1 | 1 |
! FILE hw1.c ! 2 !extern short int *array; ! 3 !extern int *hist; ! 5 !#define ISIZE 10 ! 6 !#define JSIZE 20 ! 8 !void histo() ! 9 !{ ! 10 ! int i,j; ! 12 ! for(i=0; i<ISIZE; i++) ! Block 1 /* 000000 12 */ or %g0,0,%g4 /* 0x0004 */ or %g0,0,%o5 ! 13 ! for(j=0; j<JSIZE; j++) /* 0x0008 13 */ or %g0,0,%g1 ! j = 0; /* 0x000c */ sethi %hi(hist),%g3 /* 0x0010 */ sethi %hi(array),%g2 ! Block 2 .L900000109: ! 14 ! { ! 15 ! int e= array[ i * JSIZE + j ]; ! 16 ! hist[ e ]++; /* 0x0014 16 */ sll %o5,1,%o4 .L77000005: ! o3 = &array[0] /* 0x0018 */ ld [%g2+%lo(array)],%o3 ! volatile ! Block 3 .L900000110: ! o3 = array[ i * JSIZE + j ] = e /* 0x001c */ ldsh [%o3+%o4],%o3 ! volatile ! 17 ! } /* 0x0020 17 */ add %g1,1,%g1 ! j = j + 1 /* 0x0024 */ add %o4,2,%o4 ! ! Below, o2 = &hist /* 0x0028 16 */ ld [%g3+%lo(hist)],%o2 ! volatile /* 0x002c 17 */ cmp %g1,20 ! Compare j to 20. /* 0x0030 16 */ sll %o3,2,%o3 ! e * sizeof(int) /* 0x0034 */ add %o2,%o3,%o1 ! o1 = &hist[e] ! Below, o1 = hist[e]; /* 0x0038 */ ld [%o1],%o1 ! volatile /* 0x003c */ add %o1,1,%o1 ! o1 = hist[e] + 1 ! Below, save: hist[e] = hist[e] + 1 /* 0x0040 */ st %o1,[%o3+%o2] ! volatile ! Referring to cmp above, branch if j < 20. /* 0x0044 17 */ bl,a .L900000110 ! Branch less than. ! Instruction below is skipped if branch ! not taken. o3 = &array[0]; /* 0x0048 */ ld [%g2+%lo(array)],%o3 ! volatile ! Block 4 .L77000007: /* 0x004c */ add %g4,1,%g4 ! i = i + 1 ! Below, instead of multiplying i * JSIZE, ! just add JSIZE (20) each iteration. /* 0x0050 */ add %o5,20,%o5 /* 0x0054 */ cmp %g4,10 ! Compare i to ISIZE /* 0x0058 */ bl .L900000109 ! If less, continue ! Instruction below executed even if ! branch above is taken. /* 0x005c */ or %g0,0,%g1 ! j = 0 ! Block 5 .L77000010: /* 0x0060 */ retl /* 0x0064 */ nop /* 0x0068 0 */ .type histo,2 /* 0x0068 */ .size histo,(.-histo)
David M. Koppelman - koppel@ee.lsu.edu | Modified 20 Feb 1998 12:09 (18:09 UTC) |