Publications, Manuscripts, and Reports

Selected publications are listed below by category including abstracts. "Manuscript" links point to gnuzipped PostScript and Portable Document Format (PDF) files. PDF material can be viewed using the free Adobe Acrobat Reader.

Branch Prediction, Pre-Execution


The Impact of Fetch Rate and Reorder Buffer Size On Speculative Pre-Execution

David M. Koppelman, Workshop on Duplicating, Deconstructing, and Debunking Held in conjunction with the 30th International Symposium on Computer Architecture June 8, 2003, pp. 64-75.

Pre-execution systems reduce the impact of cache misses and branch mispredictions by forking a slice, a code fragment derived from the program, in advance of frequently mispredicted branches and frequently missing loads in order to either resolve the branch or prefetch the load. Because unnecessary instructions are omitted the slice reaches the branch or load before the main thread does, for loads this time margin can reduce or even eliminate cache miss delay. Published results have shown significant improvements for some benchmarks, on the order of 20%, with many showing at least single-digit improvements. These studies left unexamined two system parameters that one would expect pre-execution to be sensitive to: fetch rate and reorder buffer size. Higher fetch rate would allow the main thread to reach the troublesome load sooner, but would not affect the slice and so the slice's margin is reduced. Studies have shown large potential margins for slices, but the fetch rate effect has not been measured. A second system parameter is reorder buffer size. A larger reorder buffer would allow a system to hide more of the miss latency that pre-execution reduces. To test the sensitivity to these factors pre-execution schemes were simulated on systems with varying fetch rates and reorder buffer sizes. Results show that higher fetch rate does not reduce pre-execution speedup in most benchmarks. Reorder buffer size sensitivity varies, some benchmarks are insensitive to reorder buffer size increases beyond 256 entries, but still benefit from pre-execution, the benefit due in large part to prefetching those loads that provide values for frequently mispredicted branches. The benchmarks that are sensitive to reorder buffer size are also the ones that benefit most from pre-execution.

Keywords: pre-execution

Manuscript (211 kB PDF) Presentation (813 kB PDF) Hide Abstracts


The Benefit of Multiple Branch Prediction On Dynamically Scheduled Systems

David M. Koppelman, Workshop on Duplicating, Deconstructing, and Debunking Held in conjunction with the 29th International Symposium on Computer Architecture May 26, 2002, pp. 42-51.

Providing instructions for wide-issue systems is a challenge because the execution path must be predicted before branches are even decoded. Since basic blocks can easily be smaller than the issue width of near future systems multiple branch prediction is important. An early solution to the problem is the multiple branch predictor (MBP) of Yeh, Marr, and Patt. A PC-indexed branch address cache (BAC) provides information on all basic blocks within a number, say 3, basic blocks of the PC; a path through the blocks is chosen by a global-history branch predictor. The published evaluation did not include overall performance numbers, a shortcoming shared by many subsequent studies of similar systems. To fill this gap dynamically scheduled processors using a MBP or a conventional mechanism were simulated using RSIM. For comparison a multiple branch predictor based on superblocks, described by Reinman, Austin, and Calder was also simulated. Compared to single branch prediction, 8-way superscalar MBP systems show speedup of 10% on some SPECcpu integer benchmarks on a resource-constrained system, the superblock predictor realizes an improvement of 8%. MBP tree depths of 2 and 3 show nearly equal performance. As one would expect, MBP systems fill more decode slots, but data show that much of the performance gain is due to providing more time for dependent integer instructions. Much smaller are gains due to early initiation of memory operations (except when using fast caches) and a less frequently empty reorder buffer.

Keywords: multiple branch prediction, CPU front end, superblock

Manuscript (159 kB PDF) Presentation (455 kB PDF) Hide Abstracts


Shared Memory


Neighborhood Prefetching On Multiprocessors Using Instruction History

David M. Koppelman, International Conference on Parallel Architectures and Compilation Techniques, October 2000, pp. 123-132.

A multiprocessor prefetch scheme is described in which a miss is followed by a prefetch of a group of lines, a neighborhood, surrounding the demand-fetched line. The neighborhood is based on the data address and the past behavior of the access instruction that missed the cache. A neighborhood for an instruction that misses the cache is constructed by recording the offsets of addresses that subsequently miss. This neighborhood prefetching can exploit sequential access as can sequential prefetch and can to some extent exploit stride access, as can stride prefetch. Unlike stride and sequential prefetch it can support irregular access patterns, for example, accessing some subset of a structure's members. Neighborhood prefetching was compared to adaptive sequential prefetching using execution-driven simulation. Results show more useful prefetches, and lower execution time, for neighborhood prefetching for six of eight SPLASH-2 benchmarks. On eight SPLASH-2 benchmarks the average normalized execution time is less than 0.9, for three benchmarks, less than 0.8.

Keywords: multiprocessor, prefetching, coherent-cache shared memory.

Manuscript (165 kB PDF) Longer Version (211 kB PDF) Presentation (272 kB PDF) Hide Abstracts


Using Predictions of Non-Regular Memory Access Patterns to Improve Multiprocessor Cache Performance (Neighborhood Prefetching)

David M. Koppelman, LSU ECE Department Seminar, March 1999

Computer engineers are suffering in a hell partly of their own creation: while the time needed to access memory has slowly dropped, the number of instructions that can be executed during that time has increased exponentially. Since execution must sometimes wait for the outcome of a memory access, efforts at improving instruction execution speed are increasingly futile. This memory access latency can be partly avoided by caching data in a small, expensive, high-speed memory. Memory latency is still a problem because needed data is not always in the cache. An aggressive solution to the problem is prefetching: bringing data into the cache before it is needed using predictions of memory access addresses. Prefetch schemes reported in the literature predict consecutive or arithmetic address sequences, for example, 1000, 1025, 1050, 1075, etc. A more flexible scheme is described here: a neighborhood is found for memory access instructions based on their past behavior. Address predictions are made using a neighborhood, these can form sequential or stride sequences, but can also form irregular sequences, increasing the number of useful prefetches and reducing the number of useless ones. This neighborhood prefetching was evaluated by execution-driven simulation of shared-memory parallel computers (multiprocessors) running SPLASH 2 benchmark programs. Performance improvements of up to 25% were obtained on neighborhood prefetching schemes having storage requirements of only 7% that of the cache itself.

Keywords: multiprocessor, prefetching, coherent-cache shared memory.

Manuscript (736 kB PDF) Hide Abstracts


Speculative Multiprocessor Cache Line Actions Using Instruction and Line History

David M. Koppelman , The 10th International Conference on Parallel & Distributed Computing Systems, New Orleans, Louisiana, October 1997, pp. 401-406.

A technique is described for reducing miss latency in coherent-cache shared-memory parallel computers. Miss latency is reduced by speculatively invalidating and updating (copying an exclusively held line back to memory) cache lines at one processor that might be needed at another processor. A line becomes a candidate for speculative invalidation when another line last accessed by the same instruction is invalidated. A line becomes a candidate for speculative updating under corresponding conditions. The technique can be implemented by constructing linked lists of lines for each memory access instruction. The amount of memory needed by an implementation is less than 50% the size of the cache, roughly comparable to the cost of adding error-correcting memory. No time need be added to cache hits. In execution-driven simulations of such systems running programs from the Splash 2 suite invalidations and updates are reduced by 50% or more. Total time spent servicing misses is reduced by about 20% on 16-processor systems, execution time is reduced as much as 20% for some benchmarks on high latency systems.

Keywords: multiprocessor, speculative invalidation, speculative updating, coherent-cache shared memory.

Manuscript (128 kB PDF) Presentation (225 kB PDF) Hide Abstracts


Speculative Multiprocessor Cache Line Actions Using Instruction History

David M. Koppelman Manuscript (223 kB PDF)

Hide Abstracts


A Multiprocessor Memory Processor for Efficient Sharing and Access Coordination

David M. Koppelman , Workshop on Mixing Logic and DRAM, International Symposium on Computer Architecture, June 1997

The growing disparity between instruction issue rates and memory access speed impacts multiprocessors especially hard under certain circumstances. To alleviate the problem a system is described here in which smart memory chips can execute simple operations so that certain tasks can be completed with less contention, fewer messages, or by avoiding synchronization that would otherwise be necessary. These operations, issued in only a few cycles by a CPU, direct the memory to read, modify, and write a memory location. Tag values can be used to delay completion of an operation until the needed tag value is set. Operations can have multiple steps, and can be sent between memories to complete. A completion counter at the issuing CPU can be used to wait until in-progress operations have completed. Demonstrating their usefulness, execution-driven simulation of such systems shows speedup of well over two times on code fragments chosen for their suitability. The radix sorting program from the SPLASH-2 benchmark shows over 29% reduction in execution time.

Keywords: smart memory, tagged memory, multiprocessors, latency tolerance

Manuscript (209 kB PDF) Presentation (319 kB PDF) Hide Abstracts


Reducing PE/Memory Traffic in Shared Memory Multiprocessors By the Difference Coding of Memory Addresses

David M. Koppelman , IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 11, pp. 1156-1168, November 1994

A method of reducing the volume of data flowing through the network in a shared memory parallel computer (multiprocessor) is described. The reduction is achieved by difference coding the memory addresses in messages sent between processing elements (PEs) and memories. In an implementation each PE would store the last address sent to each memory and vice versa. Messages that would normally contain an address instead contain the difference between the address associated with the current and most recent messages.

Trace driven simulation shows that only 70% or less of traffic volume (including data and overhead) is necessary, even in systems using coherent caches. The reduction in traffic could result in a lower cost or lower latency network. The cost of the hardware to achieve this is small, and the delay added is insignificant compared to network latency.

Keywords: address coding, difference coding, network traffic reduction, required bandwidth reduction, parallel processor, multiprocessor, shared memory, coherent caches

Manuscript (454 kB PDF) Hide Abstracts


Reducing PE/Memory Traffic in Shared Memory Multiprocessors By the Difference Coding of Memory Addresses

David M. Koppelman , in International Conference on Parallel Processing, August 1992, vol. I, pp. 53-56.

Hide Abstracts


Network Design


A Family of Networks for Non-Uniform Traffic

David M. Koppelman , IEEE Transactions on Parallel and Distributed Systems vol. 7, no. 5, pp. 486-492, May 1996

New networks, called GLO networks, are constructed by adding bus-like links to omega networks, providing additional capacity between cells on momentarily busy paths. Equivalent pin-count GLO and omega networks offered uniform and nonuniform traffic were simulated. GLO networks exhibited lower latency for nonuniform traffic and light to moderate uniform traffic.

Keywords: congestion, congestion resistance, interconnection networks, non-uniform traffic, omega networks, banyan networks

Manuscript (156 kB PDF) Hide Abstracts


A Self-Routing Permutation Network

David M. Koppelman and A. Yavuz Oruç , Journal of Parallel and Distributed Computing, vol. 10, no. 1, pp. 140-151, 1990'

A self-routing permutation network is a connector which can set its own switches to realize any one-to-one mapping of its inputs onto its outputs. Many permutation networks have been reported in the literature, but none with the self-routing property, except crossbars and cellular permutation arrays which have excessive cost. This paper describes a self-routing permutation network which has O(log³n) bit-level delay and uses O(n log³ n) bit-level hardware where n is the number of inputs to the network. The network is derived from a complementary Benes network by replacing each of its two switches in its first stage by what is called a 1-sorter, and recursively defining the switches in the third stage as self-routing networks. The use of 1-sorters results in substantial reduction in both propagation delay and hardware cost when contrasted with O(n) delay and O(n1.59) hardware of the recursively decomposed version of a complementary Benes network. Furthermore, these complexities match the propagation delay and hardware cost of Batchers sorters (only networks, other than crossbars and cellular permutation arrays, which are known to behave like self-routing permutation networks.) More specifically, it is shown that the network of this paper uses about half of the hardware with about four-thirds of the delay of a Batchers sorter.

Keywords: self routing, permutation network, interconnection network

Hide Abstracts


A Self-Routing Permutation Network

David M. Koppelman and A. Yavuz Oruç, in International Conference on Parallel Processing, August 1989, vol. I, pp. 288-295

Hide Abstracts


Network Analysis


Sticky States in Banyan Network Queues and Their Application to Analysis

David M. Koppelman , The Journal of Parallel and Distributed Computing, vol. 40, pp. 147-161, 1997

A finite-buffered banyan network analysis technique designed to model networks at high traffic loads is presented. The technique specially models two states of the network queues: queue empty and queue congested (roughly, zero or one slots free). A congested queue tends to stay congested because packets bound for the queue accumulate in the previous stage. The expected duration of this state is computed using a probabilistic model of a switching module feeding the congested queue. A technique for finding a lower arrival rate to an empty queue is also described. The queues themselves are modeled using independent Markov chains with an additional congested state added. The new analysis technique is novel in its modeling the higher arrival rate to a congested queue and a lower arrival rate to an empty queue. Comparison of queue state distributions obtained with the analysis and simulation show that an important feature of congestion is modeled.

Keywords: banyan network, queue state distribution, congestion

Manuscript (432 kB PDF) Hide Abstracts


A Lower Bound On the Average Physical Length of Edges in the Physical Realization of Graphs

David M. Koppelman , Parallel Processing Letters, vol. 6, no. 1, pp. 137-143, 1996

[Summary of introduction.] A graph realization is an assignment of positional and other information to the elements of a graph; it can be used to bound the performance of an interconnection network based on the graph. Vitanyi describes a graph realization where processors have bounded volume; a lower bound on the average physical edge length is derived for such graph realizations. The bound is based on the graphs diameter, symmetry, and path distribution. The bound is easily applied to many popular graphs such as trees, d-dimensional toroidal meshes, and cube-connected cycles. The bounds are not easy to apply to less-symmetric graphs, in particular the de Bruijn and shuffle-exchange graphs. A modified version of Vitanyi bound, easily applied to the de Bruijn and other graphs, is presented. Also presented is a more precise realization definition.

Keywords: graphs, realization, physical limits, edge length

Manuscript (112 kB PDF) Hide Abstracts


Congested Banyan Network Analysis Using Congested-Queue States and Neighboring-Queue Effects

David M. Koppelman , IEEE/ACM Transactions on Networking, vol. 4, no. 1, pp. 106-111, February 1996

A banyan network analysis technique is presented which more accurately models a congested network than other reported techniques. The analysis is based on the observation that a full queue (within the switching modules making up the network) causes traffic to back up. For a short time after becoming full the free space in the queue is limited to no more than one slot. A queue in such a condition is called congested. Because of blocking the arrival rate to a congested queue is higher; this tends to maintain congestion. The arrival rate to a congested queues dual is lower as is the service rate for queues feeding the congested queue. These effects are captured in the analysis. The state model used for a queue encodes congested as well as normal operation. Further, the model codes whether the connected next-stage and dual queues are congested. Network throughput computed with the model is closer to that obtained from simulations than other banyan analyses appearing in the literature, including those designed to model congestion. Further, the queue-occupancy distributions are much closer than other analyses, suggesting that the analysis better models conditions in congested banyan networks.

Keywords: banyan network, congestion, analytic model

Manuscript (130 kB PDF) Hide Abstracts


Congested Banyan Network Analysis Using Congested-Queue States and Neighboring-Queue Effects

David M. Koppelman, Technical Report, Louisiana State University, Baton Rouge, LA, September 1995 Manuscript (275 kB PDF)

Hide Abstracts


An Analysis of Banyan Networks Offered Traffic with Geometrically Distributed Message Lengths

I-Pyen Lyaw and David M. Koppelman , IEE Proceedings - Communications, vol. 142, no. 5, pp. 285-291, October 1995

An analysis of finite-input-buffered banyan networks offered traffic having geometrically distributed message lengths is presented. This is one of the few multistage-network analyses for networks offered non-unit-length messages and is the only one that the authors are aware of for finite-input-buffered banyan networks. In the analysis, network switching elements are modeled using two state-machines, one for queue heads (HOLs), the other for entire queues. A network is modeled using one switching-element model to represent each stage. Together these model the effect that non-unit-length messages have on banyans. Solutions are obtained iteratively. Network performance figures were obtained with this analysis and compared to simulation results. The figures show that the analysis can predict the effect of message length on throughput and delay, including the performance degradation caused by longer messages.

Keywords: banyan network, nonuniform message length traffic, analytic model

Manuscript (117 kB PDF) Hide Abstracts



ECE Home Page Up
David M. Koppelman - koppel@ece.lsu.edu
Modified 22 Dec 2004 12:36 (1836 UTC)