LSU EE 7700-3 -- Advanced Computer Implementation -- Spring 2005
Lecture Set 1
Dataflow Limit and Course Outline
This set is incomplete, most material covered in class on blackboard.
====================
Wall 93
Can achieve 20-100 IPC.
Techniques Within Dataflow Limit
Techniques Outside of Dataflow Limit
Explicit Parallelism
Ordinary Program:
One thread. (PC, set of registers).
Parallel Program:
Multiple Threads (Can be run on multiple processors.)
Parallel Program Execution
Start as an ordinary program.
Create a new thread. (By forking or other means.)
Parent thread creates child thread (or children).
Children can also fork, creating more threads.
Parent and child threads cooperate on problem.
Occasionally one must wait for another:
Use some kind of synchronization command.
Tricky Parts
Finding good fork points.
Synchronizing properly.
Parallel Programming Models
Shared Memory
Communication by shared memory (reading and writing arrays, say).
Thread 1: a[17] = x + y;
Thread 2: z = a[17];
Message Passing
Communication by sending messages.
Thread 1: message_send(2,x+y);
Thread 2: z = message_receive();
Parallel Organizations
SMT (HT): (Simultaneous Multithreaded)
Single core switches between fetching different threads.
Instructions from different threads simultaneously in flight.
Not a true parallel machine because some things not done in parallel.
Multiprocessor:
Multiple processors, but share memory.
Cache must deal with writes from multiple processors to same location.
Expensive when more than about four processors.
Examples:
Common dual-processor PCs.
Sol: Eight-processor Sun machine.
Cluster
Multiple processors, do not share memory.
Communication by network (maybe high speed).
Inexpensive.
Example:
SuperMike: 1024-processors.
Explicit Parallelism v. ILP
ILP: Run existing code.
EP: Potentially much more efficient and faster.
Efficiency v. Speed
Speed Meaning
Time needed to run program. (Low time = high speed)
Efficient Meaning:
Speed / Area
Energy needed to run program.
Single Processor: Faster means less efficient.
More time spent fetching wrong path.
More hardware needed to manage out-of-order execution.
Efficient: Five stage MIPS. Zero branch penalty.
Inefficient: Superscalar, dynamically scheduled. Large branch penalty.
Efficiency v. Parallelism
Efficiency (one measure) : Instructions in parallel execution / Insn in serial exec.
Efficiency drops as number of processors increases.
High Efficiency Parallel Programs
Little communication, little duplicated work.
Example: Eureka problems, such as breaking a cipher by trying all keys.
Moderate Efficiency
Some communication and synchronization overhead. Some duplicated work.
Finite element programs.
Efficiency v. Speed and Parallel Computers
If it were easy to write parallel programs:
Computers would use many simple CPUs.
But it's not easy.
Most parallel computers use inefficient but fast processors
despite the fact that programs already parallelized.
Because inefficiency of processor is preferable to that in further parallel execution.
A few specialized parallel computer use many efficient processors.
These are difficult to program.
Up To Now
Easier to build faster implementations than to parallelize code (for most programs).
This may change.