# Simple Cost and Performance Model

## Simple Models Goals

Choose between design alternatives.

Characterize a particular design approach.

The models will be used throughout the semester.

## Simple Model Non-Goals

Approximate the post synthesis cost and performance.

## Practice Problems

Problems based on the material in these slides.

2016 Final Exam Problem 2b and Problem 4 (greedy and fcfs fit).

#### The Simple Cost Model

#### Simple Model Base Costs

2-input AND gate: 1 u<sub>c</sub>. (One unit of cost.)

2-input OR gate: 1 u<sub>c</sub>.

NOT gate:  $0 u_c$ . (Zero cost.)

#### Simple Model Derived Costs

Based on equivalent circuit using gates above.

*E.g.*, cost of *n*-input OR gate: (n-1) u<sub>c</sub>.

*E.g.*, cost of *n*-input AND gate: (n-1) u<sub>c</sub>.

*E.g.*, cost of a 2-input XOR gate is  $3 u_c$ .

E.g., cost of an edge-triggered flip-flop is  $7 u_c$ .

E.g., cost of a *w*-bit edge-triggered register is  $7w u_c$ .

Simple Examples:

- A 2-input AND gate: cost is  $1 u_c$ .
- A 10-input OR gate: cost is  $9 u_c$ .
- A 1-input OR gate: free.



#### The Simple Performance Model

The Simple Performance Model Base Delays

2-input AND gate:  $1 u_t$  (One time unit.)

2-input OR gate: 1 ut.

NOT gate:  $0 u_t$ .

The Simple Performance Model Derived Delays

Based on equivalent circuit using gates above.

*E.g.*, delay of a 2-input XOR gate:  $2 u_t$ .

*E.g.*, delay of *n*-input OR gate:  $\lceil \lg n \rceil u_t$ .

*E.g.*, delay of *n*-input AND gate:  $\lceil \lg n \rceil u_t$ .

E.g., delay of an edge-triggered flip-flop is  $6 u_t$ .

### Multiplexors



A 2-input, 1-bit mux:

Cost, 3 units; delay, 2 units.

### A 2-input, w-bit mux:

This is equivalent to w copies of the mux above.

Cost, 3w units; delay, 2 units.

### Multiplexors

An *n*-input, *w*-bit mux, unoptimized decoder implementation:

Note: The number of select bits is  $\lceil \lg n \rceil$ .

Cost Computation:

Decoder AND gates (first column):  $n(\lceil \lg n \rceil - 1)$ .

Selection AND gates (second column): wn.

```
OR gate: w(n-1).
```

Total cost:  $n(\lceil \lg n \rceil - 1) + 2wn - w$ .



### Delay Computation

Decoder AND gate (first column):  $\lceil \lg(\lceil \lg n \rceil - 1) \rceil$ .

Selection AND gates (second column): 1.

OR gate:  $\lceil \lg n \rceil$ .

Total delay:  $\lceil \lg(\lceil \lg n \rceil - 1) \rceil + 1 + \lceil \lg n \rceil$ .

Approximate delay:  $1 + \lceil \lg n \rceil$ .



An *n*-input, *w*-bit mux, tree implementation:

Constructed from 2-input multiplexors.

Illustration is for n = 8.

The path from the selected input ... ... is through  $\lceil \lg n \rceil$  2-input muxen ... ... through 3 for illustrated size, n = 8.

The number of muxen connected to select bit i...... is  $n/2^{i+1}$  for  $0 \ge i < \lceil \lg n \rceil ...$ ... for illustrated size 2 muxen connect to bit 1.

## Cost Computation

Total number of 2-input muxen ...

 $\dots \sum_{i=0}^{\lg n-1} n/2^{i+1} = n-1 \dots$ ... for illustrated mux, 7 2-input muxen.

Total cost: 3w(n-1).

## Total Delay: $2\lceil \lg n \rceil$ .





Delay Computation

XOR Gates: 2.

Reduction Tree:  $\lg w$ .

Total Delay:  $2 + \lg w \approx \lg w$ .

Binary Full Adder Implementations

Fast BFA

Cost Computation

Two XOR, 3 AND:  $2 \times 3 + 3 = 9$  units.





This delay is useful when **a** and **b** arrive earlier than **ci**.

See Path II and green labels in diagram.

Delay: 2 units.

## w-Bit Ripple Adder

Implementation using Fast BFA



Cost Computation: Cost of w BFAs: 9w units.

Delay Computation

See critical path (in red) in diagram.

Delay:  $2(w+1) \approx 2w$ .

Integer Magnitude Comparison

For comparisons like a < b.

Implementation:

Compute a - b and check whether result negative.

If carry out of MSB is 0 then a - b < 0 and so a < b is true.

Omit sum hardware in BFA, and replace remaining XOR with an OR.

Cost Computation

Each modified BFA now costs 4 units.

Total cost: 4w units.

Delay Computation

Delay is 3 for first bit, 2 for remaining bits.

Total delay:  $2w + 1 \approx 2w$ .

smod-14

## Cascaded Ripple Units

For computations using ripple units... ... such as a + b + e, and (a + b) < e, etc.

Cost Computation

Cost is sum of costs of each ripple unit.

For example, a + b + c is two ripple adders, cost is  $18w u_c \dots$ 

 $\dots (a+b) < e$  is a ripple adder plus a magnitude comparison:  $9w + 4w = 13w u_c$ .

### Cascaded Ripple Units

Delay Computation:

Consider (a+b) + e.

Naïve analysis: wait for a + b to finish, then start +e.

But, LSB of a + b available after only  $4 u_t \dots$ ... bit *i* is available after  $(4 + 2i) u_t \dots$ ... so the +e computation can start after  $4 u_t$ .

Delay for two ripple units is  $4 + 2(w + 1) u_t$ .

Delay for bit i at output of n ripple units is  $4(n-1) + 2(i+2)u_t$ .

Delay for n ripple units is  $4(n-1) + 2(w+1)u_t$ .

### Cost and Performance with Constant Inputs

### Constant Inputs

### Signal values which never change.

Cost and delay are radically different when an input never changes.

In Verilog, this might be an elaboration-time constant ... ... or other expressions that never change.

## Multiplexor Constant-Input Optimizations



Sample mux optimizations:

Costs: From top to bottom costs are: 1 unit, 1 unit, zero units.

Delays: From top to bottom delays are: 1 unit, 1 unit, zero units.