# EE 4720 Computer Architecture - HW 1 Solution (Spring 1998)

### Problem 1

#### Temporal Locality

Instruction fetches for instructions within loop. (Since they are fetched multiple times.)

The store of the incremented histogram element, since the same address was used to load the element.

```SW   0(r10), rll   ! hist[e] = r11
```

#### Spatial Locality

Accesses to the `array` elements. (Since they are done sequentially.) For example:

```LH   r10, 0(r10)   ! r10 = array[ i*JSIZE + j ]
```

#### Possibly Neither Temporal Nor Spatial Locality

Accesses to the `hist` array, since the pattern of access depends on the value of elements of `array`, which are not known in advance.

```LW   r11, 0(r10)   ! r11 = hist[ e ];
```

### Problem 2

The static count is simply the number of instructions, 24. The dynamic count is 3275 instructions, the table below shows the dynamic count of each instruction. The labels (e.g., `NEXTI:`) are included for clarity.

 Dynamic ` Instruction ` 1 `ADDI r10, r0, #20 ` 1 `MOVI2FP f0, r10 ` 1 `ADDI r1, r0, #0 ` `NEXTI: ` 11 `SGEI r10, r1, #10 ` 11 `BNEZ r10, DONEI ` 10 `ADDI r2, r0, #0 ` `NEXTJ: ` 210 `SGEI r10, r2, #20 ` 210 `BNEZ r10, DONEJ ` 200 `MOVI2PF f1, r1 ` 200 `MULT f1, f1, f0 ` 200 `MOVFP2I r10, f1 ` 200 `ADD r10, r10, r2 ` 200 `SLLI r10, r10, #1 ` 200 `ADD r10, r10, r4 ` 200 `LH r10, 0(r10) ` 200 `SLLI r10, r10, #2 ` 200 `ADD r10, r10, r3 ` 200 `LW r11, 0(r10) ` 200 `ADDI r11, rll, #1 ` 200 `SW 0(r10), rll ` 200 `ADDI r2, r2, #1 ` 200 `J NEXTJ ` `DONEJ: ` 10 `ADDI r1, r1, #1 ` 10 `J NEXTI ` `DONEI: `

### Problem 3

The scaled addressing mode makes accessing both arrays much easier. Element `array[i*JSIZE+j]` is still loaded into a register, but the histogram element is incremented in a memory-memory add instruction.

```
ADDI r1, r0, #0    ! i=0
NEXTI:
SGEI r10, r1, #10  ! if i >= ISIZE ...
BNEZ r10, DONEI    ! ... exit loop.
ADDI r2, r0, #0    ! j=0
NEXTJ:
SGEI r10, r2, #20  ! if j >= JSIZE ...
BNEZ r10, DONEJ    ! ... exit loop
MULTI r10, r1, JSIZE      !  r10 = i * JSIZE
ADD   r10, r10, r2        !  r10 = i * JSIZE + j
LH    r10, 0(r4)[r10]     !  d = 2; r10 = array[r10]
ADDI O(r3)[r10],0(r3)[r10],#1  ! d=4; hist[r10] = hist[r10]+1;
ADDI r2, r2, #1    ! r2 = j+1
J NEXTJ
DONEJ:
ADDI r1, r1, #1    ! r1 = i+1
J NEXTI
DONEI:
```

The new static count is 14, the dynamic count is 1674, the table below lists count by instruction.

 1 `ADDI r1, r0, #0 ` `NEXTI: ` 11 `SGEI r10, r1, #10 ` 11 `BNEZ r10, DONEI ` 11 `ADDI r2, r0, #0 ` `NEXTJ: ` 210 `SGEI r10, r2, #20 ` 210 `BNEZ r10, DONEJ ` 200 `MULTI r10, r1, JSIZE ` 200 `ADD r10, r10, r2 ` 200 `LH r10, 0(r4)[r10] ` 200 `ADDI O(r3)[r10],0(r3)[r10],#1 ` 200 `ADDI r2, r2, #1 ` 200 `J NEXTJ ` `DONEJ: ` 10 `ADDI r1, r1, #1 ` 10 `J NEXTI ` `DONEI: `

### Problem 4

#### Clock frequency for fixed CPI

Need to solve: ICnew CPInew / fnew = ICold CPIold / fold for fnew. Solving yields, fnew = 1 GHz ICnew / ICold = 511 MHz

So, even though an implementation using complex addressing modes might have a slower clock, it can still run faster. Note that the assumption that the CPI is the same is unrealistic, since it would take much longer to execute, for example, an instruction that had three memory operands.

#### CPI for fixed clock frequency.

Need to solve: ICnew CPInew / fnew = ICold CPIold / fold for CPInew. The resulting instruction execution time is CPInew = CPIold ICold / ICnew = 0.489 CPI.

So, even though CPI will suffer in an ISA using elaborate addressing, it is still possible to gain performance.

### Problem 5

#### Unoptimized Assembler

Some notes on Sparc assembler:

• Register names are: ```%i0,%i1,...,%i7, %l0,%l1,...,%l7, %o0,%o1,...,%o7, %g0,%g1,...,%g7```. The letters are for input, local, output, and general. Register `%i6` is also called `%fp`, frame pointer, and points to the current stack frame. Register `%g0`, like `r0` in DLX, is always zero.
• Operands are backwards (compared to DLX), so `add %l0,%l1,%l2` means `%l2 = %l1 + %l0`.
• Sparc branches are delayed, in the unoptimized code below the delay slots are filled with nops.
• In the code below global variable values have to be loaded into registers before use (unlike the DLX code in the assignment). This is a two-step process: first the address of the variable must be loaded, then the memory at that address is loaded. This is done for `array` and `histo`. (Variables `i`, `j`, and `e` are loaded differently since they are local to the procedure and so are stored on the stack.)
• In unoptimized code, any change in a variable is immediately written back to memory, and memory is read whenever a variable is used, even if it was written in the preceding instruction. This makes the program easier to debug.
• Instruction `mov %r1,%r2` is a synthetic instruction, the assembler emits code for a `or %g0,%r1,%r2`

The output of the Sun compiler is shown below. The compiler helpfully shows the source lines corresponding to the assembler instructions, and also breaks code into basic blocks which is convenient for finding the dynamic instruction count. (A basic block is a group of instructions that can only be entered from the first instructions [can't branch into the middle] and can only be exited from the last instruction [only the last instruction can be a CTI]. All instructions in a basic block are executed the same number of times.)

Static count: 55 (yes, nops count, they take space). Dynamic count, 6938 (yes, nops count, they at least take fetch and decode space). The count is detailed below, by basic block, followed by an annotated assembly language listing.

 BlockNumber StaticCount Dynamic Count (Per instr) Dynamic Count (Total) 1 6 1 6 2 6 10 60 3 34 200 6800 4 7 10 70 5 2 1 2
```! File hw1.c:
!    2  extern short int *array;
!    3  extern int *hist;
!    4
!    5  #define ISIZE 10
!    6  #define JSIZE 20
!    7
!    8  void histo()
!    9  {
!   10    int i,j;
!   11
!   12    for(i=0; i<ISIZE; i++)

mov     0,%l0
st      %l0,[%fp-4]   ! Store reg. l0 into frame pointer - 4 =  &i.
ld      [%fp-4],%l0   ! Load the 0 just written.
cmp     %l0,10        ! Compare i to 10, set condition bits.
bge     .L19          ! Branch if i greater than 10.
nop                   ! Sparc has a branch delay slot, wasted here.

! block 2
.L20:
.L17:

!   13      for(j=0; j<JSIZE; j++)

mov     0,%l0
st      %l0,[%fp-8]   ! Store reg l0 info frame pointer - 8 = &j...
ld      [%fp-8],%l0   ! ...and load it back in.
cmp     %l0,20        ! Compare to 20
bge     .L23          ! Branch if not smaller.
nop

! block 3
.L24:
.L21:

!   14        {
!   15          int e= array[ i * JSIZE + j ];

sethi   %hi(array),%l0        ! Load hi bits of array pointer addr.
or      %l0,%lo(array),%l0    ! Load rest of array pointer address.
ld      [%l0+0],%l3           ! Load address of first element.
ld      [%fp-4],%l0           ! Load i into l0
sll     %l0,2,%l2             ! l2 = i * 4
sll     %l0,4,%l1             ! l1 = i * 16
add     %l2,%l1,%l0           ! l0 = i * 20 = i * JSIZE;
ld      [%fp-8],%l1           ! l1 = j
add     %l0,%l1,%l0           ! l0 = j + i * JSIZE
sll     %l0,1,%l1             ! l1 = l0 * 2 = l0 * sizeof(short int)
add     %l3,%l1,%l0           ! l0 = &array[ j + i * JSIZE ]
ldsh    [%l0+0],%l0           ! l0 = array[ j + i * JSIZE ] = e
sll     %l0,16,%l0            ! l0 = l0 << 16  (Part 1 of sign ext.)
sra     %l0,16,%l0            ! l0 = l0 >> 16  (Part 2 of sign ext.)
st      %l0,[%fp-12]          ! e = l0

!   16          hist[ e ]++;

sethi   %hi(hist),%l0         ! Load hi bits of addr of hist pointer
or      %l0,%lo(hist),%l0     ! Load lo bits of addr of hist pointer
ld      [%l0+0],%l2           ! Load hist pointer, &hist[0]
ld      [%fp-12],%l0          ! Load e
sll     %l0,2,%l1             ! l1 = e * sizeof(int)
add     %l2,%l1,%l0           ! l0 = &hist[e]
st      %l0,[%fp-16]          ! Write &hist[e] into stack
ld      [%fp-16],%l0          ! Read &hist[e] from stack.
ld      [%l0+0],%l0           ! l0 = hist[e]
add     %l0,1,%l1             ! l1 = hist[e] + 1
ld      [%fp-16],%l0          ! Read &hist[e] from stack.
st      %l1,[%l0+0]           ! Store hist[e]+1

!   17        }

ld      [%fp-8],%l0           ! l0 = j
add     %l0,1,%l0             ! l0 = l0 + 1
st      %l0,[%fp-8]           ! Store new value of j.
ld      [%fp-8],%l0           ! And load it again.
cmp     %l0,20                ! Compare to twenty
bl      .L21                  ! If less, continue.
nop

! block 4
.L25:
.L23:
ld      [%fp-4],%l0           ! l0 = i.
add     %l0,1,%l0             ! l0 = l0 + 1
st      %l0,[%fp-4]           ! i = l0, store new value of i.
ld      [%fp-4],%l0           ! Load again.
cmp     %l0,10                ! Compare to ten..
bl      .L17                  ! ...if less continue.
nop

! block 5
.L26:
.L19:
.L15:
jmp     %i7+8                 ! Return
restore
```

#### Optimized (for speed) Assembler

Notes on optimized code

• Better use of registers is made. There is less saving and loading variable values.
• Instructions are rearranged, sometimes to make more efficient use of the pipeline. (Do arithmetic several instructions before the result is needed.)
• The block labels below are instructor annotations. (The block labels in the unoptimized code were generated by the compiler.)
• The "volatile" comment was placed by the compiler. That indicates a variable is being loaded even though it could have been left in a register because it's possible that the value changed in memory while the code was executing. Were `array` and `hist` declared `static`, no other procedure would "know" there names, and the compiler would not have to load there values multiple times.

Static count, 26, dynamic count 2476. Alot fewer!

 BlockNumber StaticCount Dynamic Count(Per Instr) Dynamic Count(Total) 1 5 1 5 2 2 10 20 3 12 200 2400 4 5 10 50 5 2 1 1
```! FILE hw1.c

!    2		      !extern short int *array;
!    3		      !extern int *hist;
!    5		      !#define ISIZE 10
!    6		      !#define JSIZE 20
!    8		      !void histo()
!    9		      !{
!   10		      !  int i,j;
!   12		      !  for(i=0; i<ISIZE; i++)

! Block 1

/* 000000	  12 */		or	%g0,0,%g4
/* 0x0004	     */		or	%g0,0,%o5

!   13		      !    for(j=0; j<JSIZE; j++)

/* 0x0008	  13 */		or	%g0,0,%g1       ! j = 0;
/* 0x000c	     */		sethi	%hi(hist),%g3
/* 0x0010	     */		sethi	%hi(array),%g2

! Block 2

.L900000109:

!   14		      !      {
!   15		      !	int e= array[ i * JSIZE + j ];
!   16		      !	hist[ e ]++;

/* 0x0014	  16 */		sll	%o5,1,%o4
.L77000005:
! o3 = &array[0]
/* 0x0018	     */		ld	[%g2+%lo(array)],%o3 ! volatile

! Block 3

.L900000110:
! o3 = array[ i * JSIZE + j ] = e
/* 0x001c	     */		ldsh	[%o3+%o4],%o3 ! volatile

!   17		      !      }

/* 0x0020	  17 */		add	%g1,1,%g1  ! j = j + 1
/* 0x0024	     */		add	%o4,2,%o4  !
! Below, o2 = &hist
/* 0x0028	  16 */		ld	[%g3+%lo(hist)],%o2 ! volatile
/* 0x002c	  17 */		cmp	%g1,20     ! Compare j to 20.
/* 0x0030	  16 */		sll	%o3,2,%o3  ! e * sizeof(int)
/* 0x0034	     */		add	%o2,%o3,%o1 ! o1 = &hist[e]
! Below, o1 = hist[e];
/* 0x0038	     */		ld	[%o1],%o1 ! volatile
/* 0x003c	     */		add	%o1,1,%o1  ! o1 = hist[e] + 1
! Below, save: hist[e] = hist[e] + 1
/* 0x0040	     */		st	%o1,[%o3+%o2] ! volatile
! Referring to cmp above, branch if j < 20.
/* 0x0044	  17 */		bl,a	.L900000110  ! Branch less than.
! Instruction below is skipped if branch
! not taken.  o3 = &array[0];
/* 0x0048	     */		ld	[%g2+%lo(array)],%o3 ! volatile

! Block 4

.L77000007:
/* 0x004c	     */		add	%g4,1,%g4   ! i = i + 1
! Below, instead of multiplying i * JSIZE,
! just add JSIZE (20) each iteration.
/* 0x0050	     */		add	%o5,20,%o5
/* 0x0054	     */		cmp	%g4,10     ! Compare i to ISIZE
/* 0x0058	     */		bl	.L900000109  ! If less, continue
! Instruction below executed even if
! branch above is taken.
/* 0x005c	     */		or	%g0,0,%g1    ! j = 0

! Block 5

.L77000010:
/* 0x0060	     */		retl
/* 0x0064	     */		nop
/* 0x0068	   0 */		.type	histo,2
/* 0x0068	     */		.size	histo,(.-histo)
```

 David M. Koppelman - koppel@ee.lsu.edu Modified 20 Feb 1998 12:09 (18:09 UTC)