EE 4720 Computer Architecture - HW 1 Solution (Spring 1998)

Problem 1

Temporal Locality

Instruction fetches for instructions within loop. (Since they are fetched multiple times.)

The store of the incremented histogram element, since the same address was used to load the element.

SW   0(r10), rll   ! hist[e] = r11

Spatial Locality

Accesses to the array elements. (Since they are done sequentially.) For example:

LH   r10, 0(r10)   ! r10 = array[ i*JSIZE + j ]

Possibly Neither Temporal Nor Spatial Locality

Accesses to the hist array, since the pattern of access depends on the value of elements of array, which are not known in advance.

LW   r11, 0(r10)   ! r11 = hist[ e ];

Problem 2

The static count is simply the number of instructions, 24. The dynamic count is 3275 instructions, the table below shows the dynamic count of each instruction. The labels (e.g., NEXTI:) are included for clarity.

Dynamic	`Instruction`
1	`ADDI r10, r0, #20`
1	`MOVI2FP f0, r10`
1	`ADDI r1, r0, #0`
	`NEXTI:`
11	`SGEI r10, r1, #10`
11	`BNEZ r10, DONEI`
10	`ADDI r2, r0, #0`
	`NEXTJ:`
210	`SGEI r10, r2, #20`
210	`BNEZ r10, DONEJ`
200	`MOVI2PF f1, r1`
200	`MULT f1, f1, f0`
200	`MOVFP2I r10, f1`
200	`ADD r10, r10, r2`
200	`SLLI r10, r10, #1`
200	`ADD r10, r10, r4`
200	`LH r10, 0(r10)`
200	`SLLI r10, r10, #2`
200	`ADD r10, r10, r3`
200	`LW r11, 0(r10)`
200	`ADDI r11, rll, #1`
200	`SW 0(r10), rll`
200	`ADDI r2, r2, #1`
200	`J NEXTJ`
	`DONEJ:`
10	`ADDI r1, r1, #1`
10	`J NEXTI`
	`DONEI:`

Problem 3

The scaled addressing mode makes accessing both arrays much easier. Element array[i*JSIZE+j] is still loaded into a register, but the histogram element is incremented in a memory-memory add instruction.


        ADDI r1, r0, #0    ! i=0
NEXTI:
        SGEI r10, r1, #10  ! if i >= ISIZE ...
        BNEZ r10, DONEI    ! ... exit loop.
        ADDI r2, r0, #0    ! j=0
NEXTJ:
        SGEI r10, r2, #20  ! if j >= JSIZE ...
        BNEZ r10, DONEJ    ! ... exit loop
        MULTI r10, r1, JSIZE      !  r10 = i * JSIZE
        ADD   r10, r10, r2        !  r10 = i * JSIZE + j
        LH    r10, 0(r4)[r10]     !  d = 2; r10 = array[r10]
        ADDI O(r3)[r10],0(r3)[r10],#1  ! d=4; hist[r10] = hist[r10]+1;
        ADDI r2, r2, #1    ! r2 = j+1
        J NEXTJ
DONEJ:
        ADDI r1, r1, #1    ! r1 = i+1
        J NEXTI
DONEI:

The new static count is 14, the dynamic count is 1674, the table below lists count by instruction.

1	`ADDI r1, r0, #0`
	`NEXTI:`
11	`SGEI r10, r1, #10`
11	`BNEZ r10, DONEI`
11	`ADDI r2, r0, #0`
	`NEXTJ:`
210	`SGEI r10, r2, #20`
210	`BNEZ r10, DONEJ`
200	`MULTI r10, r1, JSIZE`
200	`ADD r10, r10, r2`
200	`LH r10, 0(r4)[r10]`
200	`ADDI O(r3)[r10],0(r3)[r10],#1`
200	`ADDI r2, r2, #1`
200	`J NEXTJ`
	`DONEJ:`
10	`ADDI r1, r1, #1`
10	`J NEXTI`
	`DONEI:`

Problem 4

Clock frequency for fixed CPI

Need to solve: IC_new CPI_new / f_new = IC_old CPI_old / f_old for f_new. Solving yields, f_new = 1 GHz IC_new / IC_old = 511 MHz

So, even though an implementation using complex addressing modes might have a slower clock, it can still run faster. Note that the assumption that the CPI is the same is unrealistic, since it would take much longer to execute, for example, an instruction that had three memory operands.

CPI for fixed clock frequency.

Need to solve: IC_new CPI_new / f_new = IC_old CPI_old / f_old for CPI_new. The resulting instruction execution time is CPI_new = CPI_old IC_old / IC_new = 0.489 CPI.

So, even though CPI will suffer in an ISA using elaborate addressing, it is still possible to gain performance.

Problem 5

Unoptimized Assembler

Some notes on Sparc assembler:

Register names are: %i0,%i1,...,%i7, %l0,%l1,...,%l7, %o0,%o1,...,%o7, %g0,%g1,...,%g7. The letters are for input, local, output, and general. Register %i6 is also called %fp, frame pointer, and points to the current stack frame. Register %g0, like r0 in DLX, is always zero.
Operands are backwards (compared to DLX), so add %l0,%l1,%l2 means %l2 = %l1 + %l0.
Sparc branches are delayed, in the unoptimized code below the delay slots are filled with nops.
In the code below global variable values have to be loaded into registers before use (unlike the DLX code in the assignment). This is a two-step process: first the address of the variable must be loaded, then the memory at that address is loaded. This is done for array and histo. (Variables i, j, and e are loaded differently since they are local to the procedure and so are stored on the stack.)
In unoptimized code, any change in a variable is immediately written back to memory, and memory is read whenever a variable is used, even if it was written in the preceding instruction. This makes the program easier to debug.
Instruction mov %r1,%r2 is a synthetic instruction, the assembler emits code for a or %g0,%r1,%r2

The output of the Sun compiler is shown below. The compiler helpfully shows the source lines corresponding to the assembler instructions, and also breaks code into basic blocks which is convenient for finding the dynamic instruction count. (A basic block is a group of instructions that can only be entered from the first instructions [can't branch into the middle] and can only be exited from the last instruction [only the last instruction can be a CTI]. All instructions in a basic block are executed the same number of times.)

Static count: 55 (yes, nops count, they take space). Dynamic count, 6938 (yes, nops count, they at least take fetch and decode space). The count is detailed below, by basic block, followed by an annotated assembly language listing.

Block Number	Static Count	Dynamic Count (Per instr)	Dynamic Count (Total)
1	6	1	6
2	6	10	60
3	34	200	6800
4	7	10	70
5	2	1	2

! File hw1.c:
!    2  extern short int *array;
!    3  extern int *hist;
!    4  
!    5  #define ISIZE 10
!    6  #define JSIZE 20
!    7  
!    8  void histo()
!    9  {
!   10    int i,j;
!   11  
!   12    for(i=0; i<ISIZE; i++)

        mov     0,%l0
        st      %l0,[%fp-4]   ! Store reg. l0 into frame pointer - 4 =  &i.
        ld      [%fp-4],%l0   ! Load the 0 just written. 
        cmp     %l0,10        ! Compare i to 10, set condition bits.
        bge     .L19          ! Branch if i greater than 10.
        nop                   ! Sparc has a branch delay slot, wasted here.

        ! block 2
.L20:
.L17:

!   13      for(j=0; j<JSIZE; j++)

        mov     0,%l0
        st      %l0,[%fp-8]   ! Store reg l0 info frame pointer - 8 = &j...
        ld      [%fp-8],%l0   ! ...and load it back in.
        cmp     %l0,20        ! Compare to 20
        bge     .L23          ! Branch if not smaller.
        nop                   

        ! block 3
.L24:
.L21:

!   14        {
!   15          int e= array[ i * JSIZE + j ];

        sethi   %hi(array),%l0        ! Load hi bits of array pointer addr.
        or      %l0,%lo(array),%l0    ! Load rest of array pointer address.
        ld      [%l0+0],%l3           ! Load address of first element.
        ld      [%fp-4],%l0           ! Load i into l0
        sll     %l0,2,%l2             ! l2 = i * 4 
        sll     %l0,4,%l1             ! l1 = i * 16
        add     %l2,%l1,%l0           ! l0 = i * 20 = i * JSIZE;
        ld      [%fp-8],%l1           ! l1 = j
        add     %l0,%l1,%l0           ! l0 = j + i * JSIZE
        sll     %l0,1,%l1             ! l1 = l0 * 2 = l0 * sizeof(short int)
        add     %l3,%l1,%l0           ! l0 = &array[ j + i * JSIZE ]
        ldsh    [%l0+0],%l0           ! l0 = array[ j + i * JSIZE ] = e
        sll     %l0,16,%l0            ! l0 = l0 << 16  (Part 1 of sign ext.)
        sra     %l0,16,%l0            ! l0 = l0 >> 16  (Part 2 of sign ext.)
        st      %l0,[%fp-12]          ! e = l0

!   16          hist[ e ]++;

        sethi   %hi(hist),%l0         ! Load hi bits of addr of hist pointer
        or      %l0,%lo(hist),%l0     ! Load lo bits of addr of hist pointer
        ld      [%l0+0],%l2           ! Load hist pointer, &hist[0]
        ld      [%fp-12],%l0          ! Load e 
        sll     %l0,2,%l1             ! l1 = e * sizeof(int)
        add     %l2,%l1,%l0           ! l0 = &hist[e]
        st      %l0,[%fp-16]          ! Write &hist[e] into stack
        ld      [%fp-16],%l0          ! Read &hist[e] from stack.
        ld      [%l0+0],%l0           ! l0 = hist[e]
        add     %l0,1,%l1             ! l1 = hist[e] + 1
        ld      [%fp-16],%l0          ! Read &hist[e] from stack.
        st      %l1,[%l0+0]           ! Store hist[e]+1

!   17        }

        ld      [%fp-8],%l0           ! l0 = j
        add     %l0,1,%l0             ! l0 = l0 + 1
        st      %l0,[%fp-8]           ! Store new value of j.
        ld      [%fp-8],%l0           ! And load it again.
        cmp     %l0,20                ! Compare to twenty
        bl      .L21                  ! If less, continue.
        nop

        ! block 4
.L25:
.L23:
        ld      [%fp-4],%l0           ! l0 = i.
        add     %l0,1,%l0             ! l0 = l0 + 1
        st      %l0,[%fp-4]           ! i = l0, store new value of i.
        ld      [%fp-4],%l0           ! Load again. 
        cmp     %l0,10                ! Compare to ten..
        bl      .L17                  ! ...if less continue.
        nop

        ! block 5
.L26:
.L19:
.L15:
        jmp     %i7+8                 ! Return
        restore

Optimized (for speed) Assembler

Notes on optimized code

Better use of registers is made. There is less saving and loading variable values.
Instructions are rearranged, sometimes to make more efficient use of the pipeline. (Do arithmetic several instructions before the result is needed.)
The block labels below are instructor annotations. (The block labels in the unoptimized code were generated by the compiler.)
The "volatile" comment was placed by the compiler. That indicates a variable is being loaded even though it could have been left in a register because it's possible that the value changed in memory while the code was executing. Were array and hist declared static, no other procedure would "know" there names, and the compiler would not have to load there values multiple times.

Static count, 26, dynamic count 2476. Alot fewer!

Block Number	Static Count	Dynamic Count (Per Instr)	Dynamic Count (Total)
1	5	1	5
2	2	10	20
3	12	200	2400
4	5	10	50
5	2	1	1

! FILE hw1.c

!    2		      !extern short int *array;
!    3		      !extern int *hist;
!    5		      !#define ISIZE 10
!    6		      !#define JSIZE 20
!    8		      !void histo()
!    9		      !{
!   10		      !  int i,j;
!   12		      !  for(i=0; i<ISIZE; i++)

! Block 1

/* 000000	  12 */		or	%g0,0,%g4
/* 0x0004	     */		or	%g0,0,%o5

!   13		      !    for(j=0; j<JSIZE; j++)

/* 0x0008	  13 */		or	%g0,0,%g1       ! j = 0;
/* 0x000c	     */		sethi	%hi(hist),%g3
/* 0x0010	     */		sethi	%hi(array),%g2

! Block 2

                       .L900000109:

!   14		      !      {
!   15		      !	int e= array[ i * JSIZE + j ];
!   16		      !	hist[ e ]++;

/* 0x0014	  16 */		sll	%o5,1,%o4
                       .L77000005:
			        ! o3 = &array[0]
/* 0x0018	     */		ld	[%g2+%lo(array)],%o3 ! volatile

! Block 3

                       .L900000110:
			        ! o3 = array[ i * JSIZE + j ] = e
/* 0x001c	     */		ldsh	[%o3+%o4],%o3 ! volatile

!   17		      !      }

/* 0x0020	  17 */		add	%g1,1,%g1  ! j = j + 1
/* 0x0024	     */		add	%o4,2,%o4  ! 
				! Below, o2 = &hist
/* 0x0028	  16 */		ld	[%g3+%lo(hist)],%o2 ! volatile
/* 0x002c	  17 */		cmp	%g1,20     ! Compare j to 20.
/* 0x0030	  16 */		sll	%o3,2,%o3  ! e * sizeof(int)
/* 0x0034	     */		add	%o2,%o3,%o1 ! o1 = &hist[e]
				! Below, o1 = hist[e];
/* 0x0038	     */		ld	[%o1],%o1 ! volatile 
/* 0x003c	     */		add	%o1,1,%o1  ! o1 = hist[e] + 1
				! Below, save: hist[e] = hist[e] + 1
/* 0x0040	     */		st	%o1,[%o3+%o2] ! volatile
				! Referring to cmp above, branch if j < 20.
/* 0x0044	  17 */		bl,a	.L900000110  ! Branch less than.
				! Instruction below is skipped if branch
				! not taken.  o3 = &array[0];
/* 0x0048	     */		ld	[%g2+%lo(array)],%o3 ! volatile

! Block 4

                       .L77000007:
/* 0x004c	     */		add	%g4,1,%g4   ! i = i + 1
				! Below, instead of multiplying i * JSIZE,
				! just add JSIZE (20) each iteration.
/* 0x0050	     */		add	%o5,20,%o5  
/* 0x0054	     */		cmp	%g4,10     ! Compare i to ISIZE 
/* 0x0058	     */		bl	.L900000109  ! If less, continue
				! Instruction below executed even if
				! branch above is taken.
/* 0x005c	     */		or	%g0,0,%g1    ! j = 0

! Block 5

                       .L77000010:
/* 0x0060	     */		retl
/* 0x0064	     */		nop
/* 0x0068	   0 */		.type	histo,2
/* 0x0068	     */		.size	histo,(.-histo)

David M. Koppelman - koppel@ee.lsu.edu

Modified 20 Feb 1998 12:09 (18:09 UTC)