Case Study: NVIDIA GeForce 3 Series

Overview

Early programmable GPU.

Available 2001, discontinued.

Specifications (GeForce3 Ti 500)

Memory: 64 MiB

Bandwidth: 8 GB/s.

Programmable vertex processor (shader).
References

Description of GeForce 3 Vertex Processor Microarchitecture

Good technical description in top-tier graphics conference.

_Erik Lindholm, Mark J. Kilgard, Henry Moreton, “A User-Programmable Vertex Engine,”_ SIGGRAPH 2001, p. 149-

Product Overview


Slides describing GeForce3 with good coverage of instruction set.

Specification of Vertex Processor API

Ostensibly, an API for programming, not the true set of machine instructions... however Lindholm 2001 strongly implies it is close to true instruction set.

*NV_Vertex_Program specification*,
http://www.ece.lsu.edu/gp/refs/nv-vertex-program.txt
GeForce3 Major Units

Command and Data Fetch

Vertex Processor

Single Unit

Programmable

This unit described in detail here.

Primitive Assembly Setup

Texture Shader

Four Units

An important unit, but not covered in detail until good reference found.

Z-Test, Blend, Frame Buffer Update
Operating Modes

**Render Mode:**
GPU processing vertices as vertex attributes arrive from CPU.

In render mode when processing string of `glVertex` OpenGL commands.

**Setup Mode:**
GPU changing state (configuration) in response to non-vertex data from CPU.

Setup might be needed for change of:

- Transformation matrices.
- Vertex program.
- Lighting parameters.
Preliminaries: Quad Data Type

Quad Data Type

Just one data type, the **quad**.

**Quad:**
Set of four 32-bit FP numbers in IEEE 754 format, so total size is 128 bits.

*Format* follows IEEE 754 standard but arithmetic does not:

- Many arithmetic operations not done to full precision.
- No arithmetic exceptions.
- Just one rounding mode (not four).

$0 \times x = 0 \quad \forall x$, (including non-numbers)

No integer type (with one special-purpose exception).
Data Type Rationale

Thirty-two bits sufficient for graphics.

Many graphics operations use 4-element vectors, including homogeneous coordinates and RGBA data.

True IEEE 754 arithmetic adds to cost but not to value (at least before GPGPU applications).
Preliminaries: Swizzling

Swizzling (Vector Element Rearrangement and Duplication)

**Swizzle:**
To rearrange or duplicate elements of a vector. For example, \((1, 2, 3, 4)\) can be swizzled to \((4, 2, 2, 3)\).

Swizzle Notation

Let \(R_1\) be the name of something that stores a quad.

The symbols \(x\), \(y\), \(z\), and \(w\) denote the four elements (\(x\) is first element, etc.).

Name followed by four letters (e.g., \(R_1.zyxx\)), rearrange as shown. *E.g.*, for \(R_1.zyxx\): \((1, 2, 3, 4) \longrightarrow (3, 2, 1, 1)\). (Note duplication of \(x\)).

Vertex Assembly Notation: One letter (e.g., \(R_0.y\)): duplicate, equivalent to \(R_0.yyyy\). *E.g.*, \((1, 2, 3, 4) \longrightarrow (2, 2, 2, 2)\).

GL Shader Language Notation: Name followed by \(x \in [1, 4]\) letters: vector of length \(x\) swizzled as shown. *E.g.*, let \(R_1 = (1, 2, 3, 4)\); then \(R_1.y = (2)\) (note difference with vertex assembly notation).
GeForce 3 Vertex Attribute:
One of 16 quads describing some aspect of a vertex.

Attributes are numbered and each has a specific meaning.

Attribute 0 is the vertex coordinate, attribute 2 is normal, etc.

Attribute numbers are exposed to the APIs (OpenGL, Direct3D).

Attributes number used as register number in several places.
Unit: Command and Data Fetch

In rendering mode, reads attributes from CPU.

Data from CPU in variety of formats (8-bit integer, 32-bit float, etc.) . . .

. . . and may not be full 4-element vectors.

Unit converts data to quads and writes to Vertex Attribute Buffer.

Missing array elements are initialized to 0 or 1.

**Vertex Attribute Buffer (VAB):**

Set of 16 quad registers, each register corresponds to a vertex attribute.

Hardware implementation of command / data fetch unit not described.
Vertex Processor Overview

Purpose: Apply transform & lighting computations.

Operation: Read data from VAB, write to OB.

Implemented as very simple microprogrammed processor.
VP Registers

**Input Buffer (implements, Vertex Attribute Registers):**
A set of 16 quad registers holding vertex attributes, these registers are read-only by vertex processor. Each vertex processor has several input buffers.

Number of input buffers not available.

The number might have been chosen to match operation latency.

**Constant Registers (implements, Program Parameter Registers):**
A set of 96 quad registers that are read only by vertex processor.

Constant registers do not change from vertex to vertex.

They hold data such as transformation matrices and lighting parameters.
VP Registers

**Temporary Registers:**
A set of 12 quad registers that can be read or written by vertex processor.

**Address Register:**
Effectively a single 32-bit integer register, but defined as a four-element vector of 32-bit integers. Can only be written by one instruction, ARL. Value can only be used for indexed addressing of constant (parameter) registers.

**Output Buffer (implements Vertex Result Registers):**
A set of 16 quad registers that are write only. Each VP has multiple output buffers.
### Vertex Attribute (Input Buffer) Register Names and Purpose (Table X.2)

<table>
<thead>
<tr>
<th>Register Number</th>
<th>Attribute</th>
<th>Conventional Parameter</th>
<th>Per-vertex Parameter</th>
<th>Command Mapping</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>vertex position</td>
<td>Vertex</td>
<td>x,y,z,w</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>vertex weights</td>
<td>VertexWeightEXT</td>
<td>w,0,0,1</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>normal</td>
<td>Normal</td>
<td>x,y,z,1</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>primary color</td>
<td>Color</td>
<td>r,g,b,a</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>secondary color</td>
<td>SecondaryColorEXT</td>
<td>r,g,b,1</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>fog coordinate</td>
<td>FogCoordEXT</td>
<td>fc,0,0,1</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>texture coord 0</td>
<td>MultiTexCoord(GL_TEXTURE0_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>texture coord 1</td>
<td>MultiTexCoord(GL_TEXTURE1_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>texture coord 2</td>
<td>MultiTexCoord(GL_TEXTURE2_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>texture coord 3</td>
<td>MultiTexCoord(GL_TEXTURE3_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>texture coord 4</td>
<td>MultiTexCoord(GL_TEXTURE4_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>texture coord 5</td>
<td>MultiTexCoord(GL_TEXTURE5_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>12</td>
<td>texture coord 6</td>
<td>MultiTexCoord(GL_TEXTURE6_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
<tr>
<td>13</td>
<td>texture coord 7</td>
<td>MultiTexCoord(GL_TEXTURE7_ARB,...)</td>
<td>s,t,r,q</td>
<td></td>
</tr>
</tbody>
</table>
## Vertex Result (Output Buffer) Register Names and Purpose (Table X.1)

<table>
<thead>
<tr>
<th>Vertex Result Register Name</th>
<th>Description</th>
<th>Component Interpretation</th>
</tr>
</thead>
<tbody>
<tr>
<td>HPOS</td>
<td>Homogeneous clip space position (x,y,z,w)</td>
<td></td>
</tr>
<tr>
<td>COL0</td>
<td>Primary color (front-facing) (r,g,b,a)</td>
<td></td>
</tr>
<tr>
<td>COL1</td>
<td>Secondary color (front-facing) (r,g,b,a)</td>
<td></td>
</tr>
<tr>
<td>BFC0</td>
<td>Back-facing primary color (r,g,b,a)</td>
<td></td>
</tr>
<tr>
<td>BFC1</td>
<td>Back-facing secondary color (r,g,b,a)</td>
<td></td>
</tr>
<tr>
<td>FOGC</td>
<td>Fog coordinate</td>
<td>(f,<em>,</em>,*)</td>
</tr>
<tr>
<td>PSIZ</td>
<td>Point size</td>
<td>(p,<em>,</em>,*)</td>
</tr>
<tr>
<td>TEX0</td>
<td>Texture coordinate set 0 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX1</td>
<td>Texture coordinate set 1 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX2</td>
<td>Texture coordinate set 2 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX3</td>
<td>Texture coordinate set 3 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX4</td>
<td>Texture coordinate set 4 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX5</td>
<td>Texture coordinate set 5 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX6</td>
<td>Texture coordinate set 6 (s,t,r,q)</td>
<td></td>
</tr>
<tr>
<td>TEX7</td>
<td>Texture coordinate set 7 (s,t,r,q)</td>
<td></td>
</tr>
</tbody>
</table>
**Vertex Attribute Buffer and Input Buffer**

Vertex attribute buffer (VAB) to input buffer (IB) transfer.

Data automatically copied from VAB to IB.

Transfer is triggered by a write to VAB attribute 0 (vertex position).

The 16 VAB registers are copied to the 16 registers of one of the IBs.

IB chosen in round-robin fashion.

Dirty bits used to avoid copying data that's unchanged.

Note automatic triggering of copy by write of attribute 0.
VP Instruction Set Architecture

Instruction Sets

**True Instruction Set**

Instructions recognized by vertex processor hardware.

These are not documented ... ... but are likely some kind of microinstructions.

**Exposed Instruction Set**

Instructions recognized by API calls.

Documented in OpenGL NV_Vertex_Program specification.

Lindholm 2001 implies close match to true instruction set.

NVIDIA-provided software translates exposed instruction set to true one.

Description here is of exposed instruction set.
Register Name Assembly Syntax

Based on output of NVIDIA compiler.

Input Buffer (Vertex Attribute) Register Names:

vertex_program notation: \textit{v}[0]-\textit{v}[15] or \textit{v}[\textit{OPOS}]-\textit{v}[\textit{TEX7}].

NVIDIA compiler: \textit{vertex.position}, \textit{vertex.normal}, etc.

Constant Register Names: \textit{c}[0]-\textit{c}[95].

Temporary Register Names: \textit{R0-R11}.

Output Buffer Register Names:

vertex_program notation: \textit{o}[0]-\textit{o}[15].

NVIDIA compiler: \textit{result.position}, \textit{result.color}, etc.

Example:
\textbf{MAD result.position, vertex.position.w, c[14], R0;
VP Instruction Set Architecture

Instruction Sources

Instructions can have up to three register source operands:
MAD R1, R2, R3, R4;

Any source operand can read IB, temporary, or constant registers:
ADD R1, R2, R3 (Read temporary.)
ADD R1, R2, c[3] (Read constant.)
ADD R1, R2, vertex.position (Read input buffer.)

Any source operand can be arbitrarily swizzled:
ADD R1, R2.x, R3.wzyx (Reverse order of last operand’s components.)

Any source operand can be negated:
ADD R0.y, R0, -R0.z;

Constant register can be indexed using address (not memory) register, A0:
ADD R1, -R2, c[A0];

There are no immediates (instead, place constant in constant register).
Instruction Destinations

Any instruction can write temporary and output buffer registers.

Un-exposed instructions may be able to write constant memory.

Write can target any subset of components:

\texttt{DP3 \texttt{R0.x, R0, R1;}} (Leave \texttt{R0}'s \texttt{y, z, and w unchanged.})
## VP Instruction Set Architecture

### Complete Instruction Set

From 2.14.1.9:

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Inputs (scalar or vector)</th>
<th>Output (vector or replicated scalar)</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>ARL</td>
<td>s</td>
<td>address register</td>
<td>address register load</td>
</tr>
<tr>
<td>MOV</td>
<td>v</td>
<td>v</td>
<td>move</td>
</tr>
<tr>
<td>MUL</td>
<td>v,v</td>
<td>v</td>
<td>multiply</td>
</tr>
<tr>
<td>ADD</td>
<td>v,v</td>
<td>v</td>
<td>add</td>
</tr>
<tr>
<td>MAD</td>
<td>v,v,v</td>
<td>v</td>
<td>multiply and add</td>
</tr>
<tr>
<td>RCP</td>
<td>s,v,v,v</td>
<td>sssss</td>
<td>reciprocal</td>
</tr>
<tr>
<td>RSQ</td>
<td>s,v,v,v</td>
<td>sssss</td>
<td>reciprocal square root</td>
</tr>
<tr>
<td>DP3</td>
<td>v,v,v</td>
<td>sssss</td>
<td>3-component dot product</td>
</tr>
<tr>
<td>DP4</td>
<td>v,v,v</td>
<td>sssss</td>
<td>4-component dot product</td>
</tr>
<tr>
<td>DST</td>
<td>v,v,v</td>
<td>v</td>
<td>distance vector</td>
</tr>
<tr>
<td>MIN</td>
<td>v,v,v</td>
<td>v</td>
<td>minimum</td>
</tr>
<tr>
<td>MAX</td>
<td>v,v,v</td>
<td>v</td>
<td>maximum</td>
</tr>
<tr>
<td>SLT</td>
<td>v,v,v</td>
<td>v</td>
<td>set on less than</td>
</tr>
<tr>
<td>SGE</td>
<td>v,v,v</td>
<td>v</td>
<td>set on greater equal than</td>
</tr>
<tr>
<td>EXP</td>
<td>s,v</td>
<td>v</td>
<td>exponential base 2</td>
</tr>
<tr>
<td>LOG</td>
<td>s,v</td>
<td>v</td>
<td>logarithm base 2</td>
</tr>
<tr>
<td>LIT</td>
<td>v,v</td>
<td>v</td>
<td>light coefficients</td>
</tr>
</tbody>
</table>
Instruction Descriptions

Selected instructions described below.

For descriptions of all instructions see vertex_program Section 2.14.1.10.
Instruction: **RCP** destination, source0

**Reciprocal**

```c
    t.x = source0.c;
    if (negate0) {t.x = -t.x;}
    if (t.x == 1.0f) {u.x = 1.0f;} else {u.x = 1.0f / t.x;}
    if (xmask) destination.x = u.x;
    if (ymask) destination.y = u.x;
    if (zmask) destination.z = u.x;
    if (wmask) destination.w = u.x;
```

Precision: $u.x - \text{IEEE}(1.0/t.x) < 2^{-22}$. 
Instruction: \textbf{EXP} destination, source0

Exponential Base 2

\begin{verbatim}
  t.x = source0.c;
  if (negate0) \{t.x = -t.x;\}
  q.x = 2^\text{floor}(t.x);
  q.y = t.x - \text{floor}(t.x);
  q.z = q.x \times \text{APPX}(q.y);  // Approximation of \(2^q.y\)
  if (xmask) destination.x = q.x;
  if (ymask) destination.y = q.y;
  if (zmask) destination.z = q.z;
  if (wmask) destination.w = 1.0;
\end{verbatim}

x component holds approximate result, y and z hold values needed to compute exact result.
Vertex transformation only (no lighting).

Source Code (OpenGL Shader Language):

```glsl
  gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
```

Assembler Code (Output of NVIDIA compiler):

```assembly
  PARAM c[5] = { program.local[0],
                 state.matrix.mvp.transpose }

  TEMP R0;
  MUL R0, vertex.position.y, c[2];
  MAD R0, vertex.position.x, c[1], R0;
  MAD R0, vertex.position.z, c[3], R0;
  MAD result.position, vertex.position.w, c[4], R0;
  END
  # 4 instructions, 1 R-regs
```
Transformation and Lighting

Source Code (OpenGL Shader Language):

```gl
    gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;

    vec4 vertex_e = gl_ModelViewMatrix * gl_Vertex;
    vec3 norm_e = gl_NormalMatrix * gl_Normal;
    vec4 light_pos = gl_LightSource[1].position;
    float phase_light = dot(norm_e, normalize(light_pos - vertex_e).xyz);
    float phase_user = dot(norm_e, -vertex_e.xyz);
    float phase = sign(phase_light) == sign(phase_user) ? abs(phase_light) : 0.0;
    const vec3 ambient = gl_LightSource[1].ambient.rgb;
    const vec3 diffuse = gl_LightSource[1].diffuse.rgb;
    vec4 new_color;
    new_color.rgb = gl_Color.rgb * ( phase * diffuse + ambient );
    new_color.a = gl_Color.a;
    gl_FrontColor = new_color;
    gl_BackColor = gl_Color;
```
PARAM c[15] = { { 0 },
    state.matrix.modelview.transpose,
    state.matrix.modelview.inverse.row[0..2],
    state.light[1].ambient,
    state.light[1].diffuse,
    state.light[1].position,
    state.matrix.mvp.transpose };  
  
TEMP R0; TEMP R1; TEMP R2;  
MUL R0, vertex.position.y, c[2];  
MAD R0, vertex.position.x, c[1], R0;  
MAD R0, vertex.position.z, c[3], R0;  
MAD R2, vertex.position.w, c[4], R0;  
ADD R1, -R2, c[10];  
DP4 R0.w, R1, R1;  
RSQ R0.w, R0.w;  
MUL R0.xyz, vertex.normal.y, c[6];  
MAD R0.xyz, vertex.normal.x, c[5], R0;  
MAD R0.xyz, vertex.normal.z, c[7], R0;  
MUL R1.xyz, R0.w, R1;  
DP3 R0.w, R0, -R2;  
DP3 R0.x, R0, R1;
SLT R0.y, R0.w, c[0].x;
SLT R0.z, c[0].x, R0.w;
ADD R0.w, R0.z, -R0.y;
SLT R0.z, R0.x, c[0].x;
SLT R0.y, c[0].x, R0.x;
ADD R0.y, R0, -R0.z;
ADD R0.y, R0, -R0.w;
ABS R0.y, R0;
SGE R0.y, c[0].x, R0;
ABS R0.y, R0;
ABS R0.x, R0;
SGE R0.y, c[0].x, R0;
MAD R1.x, -R0, R0.y, R0;
MUL R0, vertex.position.y, c[12];
MUL R1.xyz, R1.x, c[9];
MAD R0, vertex.position.x, c[11], R0;
ADD R1.xyz, R1, c[8];
MAD R0, vertex.position.z, c[13], R0;
MUL result.color.xyz, vertex.color, R1;
MAD result.position, vertex.position.w, c[14], R0;
MOV result.color.back, vertex.color;
MOV result.color.w, vertex.color;
END
# 35 instructions, 3 R-reg
VP Instruction Set Architecture

Instruction Set Design Choices

Based on analysis of fixed-functionality vertex processing code:

Used about 50% of time: MOV, MUL, ADD, MAD

Used about 40% of time: DP3, DP4.

RCP: Instead of divide because it’s faster.

RSQ: Within 1.5 bits of IEEE precision.
VP Microarchitecture

Register sets listed above.

Instruction memory has room for 128 instructions.

Executes at rate of one instruction per cycle.

200 MHz clock.

Two functional units.
Functional Units:

Two exposed functional units (SIMD, Special).

SIMD Vector Unit

Three source operands.

MOV, MUL, ADD, MAD, DP3, DP4, DST, MIN, MAX, SLT, SGE

Special Functional Unit

Single source operand.

RCP, RSQ, LOG, EXP, LIT

Possible additional units for fixed-function use.

All instructions have same latency.
Program Sequencing

In setup mode:

- Program loaded to program memory.
- Constants loaded into constant registers.

In render mode:

- Program run for particular IB/OB pair.
- Program starts each time an IB fills.
- Program completion signals primitive assembly unit to proceed.
- Execution multithreaded.
Program Execution

Assumed Stages (Timing and number of stages unknown, $\mu$-insn fetch omitted):

- RR: Register Read.
- SN: Swizzle and Negate.
- Ei: Execute stage i. This likely takes multiple cycles and fully pipelined.
- WB: Writeback.

Multithreaded execution is used in GeForce 3.
Single Thread (Not Multithreaded) Execution

A design option **not used** for GeForce 3.

Finish data from one IB before starting another.

Consider a pair of dependent instructions:

```
ADD r1, c[2], v[3]       RR SN E1 E2 WB
MUL o[4], r1, c[5]       RR ----> SN E1 E2 WB
```

**MUL** stalls two cycles waiting for result of **ADD**.

In GF3 number of stalls would be higher since there are more **Ei**.

+ Just need one $\mu$PC and one set of temporary registers.

- Multi-cycle stalls.

- To avoid stalls need bypass paths or scheduling opportunities.
Multithreaded Execution

Used in GeForce 3 (and most if not all modern GPUs).

Work on data from several input buffers simultaneously.

Each thread accesses data from one input buffer.

Let $t_i$ denote thread $i$.

Thread $i$ has its own set of temporary registers and $\mu$PC.

Thread $i$ reads $IB_i$ registers, writes output buffer $i$ registers.
Same pair of dependent instructions as last example.

Five threads active.

<table>
<thead>
<tr>
<th>Cycle</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<td>t0: ADD r1, c[2], v[3]</td>
<td>RR</td>
<td>SN</td>
<td>E1</td>
<td>E2</td>
<td>WB</td>
<td>&lt;- v[3] in IB 0 r1 in set 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>t1: ADD r1, c[2], v[3]</td>
<td>RR</td>
<td>SN</td>
<td>E1</td>
<td>E2</td>
<td>WB</td>
<td>&lt;- v[3] in IB 1 r1 in set 1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>t2: ADD r1, c[2], v[3]</td>
<td>RR</td>
<td>SN</td>
<td>E1</td>
<td>E2</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>t3: ADD r1, c[2], v[3]</td>
<td>RR</td>
<td>SN</td>
<td>E1</td>
<td>E2</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>t4: ADD r1, c[2], v[3]</td>
<td>RR</td>
<td>SN</td>
<td>E1</td>
<td>E2</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>t0: MUL o[4], r1, c[5]</td>
<td>RR</td>
<td>SN</td>
<td>E1</td>
<td>E2</td>
<td>WB</td>
<td>&lt;- Also for t1-t4</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

+ No stalls.

+ No bypass paths needed.

- Need multiple sets of temporary registers.

Number of IB chosen to cover execution latency.
Vertex Processor Design Factors

Exploits vertex program code characteristics:

No memory access: no memory port.

Small program size: tiny program memory.

Limited purpose: specialized instructions.

Vertex independence: easy multithreaded execution.

Repeated execution: data-triggered sequencing.
Vertex Processors in More Recent GPUs

Limited control-transfer instructions (branching).

Access to memory.

Features carefully controlled to preserve multithreading and simplify memory access.