LSU EE 7700-3 -- Advanced Computer Implementation -- Spring 2005

Lecture Set 1

Dataflow Limit and Course Outline

 This set is incomplete, most material covered in class on blackboard.

====================

Wall 93
 Can achieve 20-100 IPC.

Techniques Within Dataflow Limit

Techniques Outside of Dataflow Limit

Explicit Parallelism

 Ordinary Program: 
   One thread. (PC, set of registers).

 Parallel Program: 
   Multiple Threads (Can be run on multiple processors.)

Parallel Program Execution
 Start as an ordinary program.
 Create a new thread. (By forking or other means.)
   Parent thread creates child thread (or children).
   Children can also fork, creating more threads.
 Parent and child threads cooperate on problem.
 Occasionally one must wait for another:
   Use some kind of synchronization command.

 Tricky Parts
   Finding good fork points.
   Synchronizing properly.

Parallel Programming Models
 Shared Memory
  Communication by shared memory (reading and writing arrays, say).
   Thread 1: a[17] = x + y;
   Thread 2: z = a[17];
 Message Passing
  Communication by sending messages.
   Thread 1: message_send(2,x+y);
   Thread 2: z = message_receive();

Parallel Organizations

 SMT (HT): (Simultaneous Multithreaded)
  Single core switches between fetching different threads.
  Instructions from different threads simultaneously in flight.
  Not a true parallel machine because some things not done in parallel.

 Multiprocessor:
  Multiple processors, but share memory.
  Cache must deal with writes from multiple processors to same location.
  Expensive when more than about four processors.
  Examples: 
    Common dual-processor PCs.
    Sol:  Eight-processor Sun machine.

 Cluster
  Multiple processors, do not share memory.
  Communication by network (maybe high speed).
  Inexpensive.
  Example:
    SuperMike: 1024-processors.

Explicit Parallelism v. ILP
  ILP: Run existing code.
  EP: Potentially much more efficient and faster.

Efficiency v. Speed

 Speed Meaning
   Time needed to run program. (Low time = high speed)

 Efficient Meaning:
   Speed / Area
   Energy needed to run program.

 Single Processor: Faster means less efficient.
   More time spent fetching wrong path.
   More hardware needed to manage out-of-order execution.

 Efficient: Five stage MIPS.  Zero branch penalty.
 Inefficient: Superscalar, dynamically scheduled.  Large branch penalty.

Efficiency v. Parallelism
 Efficiency (one measure) : Instructions in parallel execution / Insn in serial exec.
 Efficiency drops as number of processors increases.

 High Efficiency Parallel Programs
   Little communication, little duplicated work.
   Example: Eureka problems, such as breaking a cipher by trying all keys.

 Moderate Efficiency
   Some communication and synchronization overhead. Some duplicated work.
   Finite element programs.


Efficiency v. Speed and Parallel Computers
 If it were easy to write parallel programs:
   Computers would use many simple CPUs.
   But it's not easy.

 Most parallel computers use inefficient but fast processors
   despite the fact that programs already parallelized.
   Because inefficiency of processor is preferable to that in further parallel execution.

 A few specialized parallel computer use many efficient processors.
   These are difficult to program.

 Up To Now
   Easier to build faster implementations than to parallelize code (for most programs).
   This may change.