LSU EE 7700-3 -- Advanced Computer Implementation -- Spring 2005 Lecture Set 1 Dataflow Limit and Course Outline This set is incomplete, most material covered in class on blackboard. ==================== Wall 93 Can achieve 20-100 IPC. Techniques Within Dataflow Limit Techniques Outside of Dataflow Limit Explicit Parallelism Ordinary Program: One thread. (PC, set of registers). Parallel Program: Multiple Threads (Can be run on multiple processors.) Parallel Program Execution Start as an ordinary program. Create a new thread. (By forking or other means.) Parent thread creates child thread (or children). Children can also fork, creating more threads. Parent and child threads cooperate on problem. Occasionally one must wait for another: Use some kind of synchronization command. Tricky Parts Finding good fork points. Synchronizing properly. Parallel Programming Models Shared Memory Communication by shared memory (reading and writing arrays, say). Thread 1: a[17] = x + y; Thread 2: z = a[17]; Message Passing Communication by sending messages. Thread 1: message_send(2,x+y); Thread 2: z = message_receive(); Parallel Organizations SMT (HT): (Simultaneous Multithreaded) Single core switches between fetching different threads. Instructions from different threads simultaneously in flight. Not a true parallel machine because some things not done in parallel. Multiprocessor: Multiple processors, but share memory. Cache must deal with writes from multiple processors to same location. Expensive when more than about four processors. Examples: Common dual-processor PCs. Sol: Eight-processor Sun machine. Cluster Multiple processors, do not share memory. Communication by network (maybe high speed). Inexpensive. Example: SuperMike: 1024-processors. Explicit Parallelism v. ILP ILP: Run existing code. EP: Potentially much more efficient and faster. Efficiency v. Speed Speed Meaning Time needed to run program. (Low time = high speed) Efficient Meaning: Speed / Area Energy needed to run program. Single Processor: Faster means less efficient. More time spent fetching wrong path. More hardware needed to manage out-of-order execution. Efficient: Five stage MIPS. Zero branch penalty. Inefficient: Superscalar, dynamically scheduled. Large branch penalty. Efficiency v. Parallelism Efficiency (one measure) : Instructions in parallel execution / Insn in serial exec. Efficiency drops as number of processors increases. High Efficiency Parallel Programs Little communication, little duplicated work. Example: Eureka problems, such as breaking a cipher by trying all keys. Moderate Efficiency Some communication and synchronization overhead. Some duplicated work. Finite element programs. Efficiency v. Speed and Parallel Computers If it were easy to write parallel programs: Computers would use many simple CPUs. But it's not easy. Most parallel computers use inefficient but fast processors despite the fact that programs already parallelized. Because inefficiency of processor is preferable to that in further parallel execution. A few specialized parallel computer use many efficient processors. These are difficult to program. Up To Now Easier to build faster implementations than to parallelize code (for most programs). This may change.