Publications, Manuscripts, and Reports

David M. Koppelman

GPU Road Network Graph Contraction and SSSP Query Roozbeh Karimi, David M. Koppelman, and Chris J. Michael, Proceedings of the ACM International Conference on Supercomputing, June 2019.

PHAST is to date one of the fastest algorithms for performing single source shortest path (SSSP) queries on road-network graphs. PHAST operates on graphs produced in part using Geisberger's contraction hierarchy (CH) algorithm. Producing these graphs is time consuming, limiting PHAST's usefulness when graphs are not available in advance. CH iteratively assigns scores to nodes, contracts (removes) the highest-scoring node, and adds shortcut edges to preserve distances. Iteration stops when only one node remains. Scoring and contraction rely on a witness path search (WPS) of nearby nodes. Little work has been reported on parallel and especially GPU CH algorithms. This is perhaps due to issues such as the validity of simultaneous potentially overlapping searches, score staleness, and parallel graph updates.

A GPU contraction algorithm, CUCH, is presented which overcomes these difficulties by partitioning the graph into levels composed of independent sets of nodes (non-adjacent nodes) with similar scores. This allows contracting multiple nodes simultaneously with little coordination between threads. A GPU-efficient WPS is presented in which a small neighborhood is kept in shared memory and a hash table is used to detect path overlap. Low-parallelism regions of contraction and query are avoided by halting contraction early and computing APSP on the remaining graph. A PHAST-like query computes SSSP using this contracted graph. Contraction of some DIMACS road network graphs on an Nvidia P100 GPU achieves a speedup of 20 to 37 over Geisberger's serial code on a Xeon E5-2640 v4. Query times on CUCH- and CH-contracted graphs were comparable.

Thoroughly Exploring GPU Buffering Options for Stencil Code by Using An Efficiency Measure and a Performance Model Yue Hu, David M. Koppelman, and Steven R. Brandt, IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 3, pp. 477-490, July-September 2018.

Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly.

Finding a good code generation strategy presents a number of challenges, one of which is the best way to make use of memory. GPUs have several types of on-chip storage including registers, shared memory, and a read-only cache. The choice of type of storage and how it's used, a buffering strategy, for each stencil array (grid function, [GF]) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF ull. For a stencil computation with N GFs, the total number of possible assignments is β^N, where β is the number of buffering strategies. Our code-generation framework supports five buffering strategies (β=5). Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead.

In this work, we present an analytic performance model for stencil computations on GPUs and study the behavior of read-only cache and L2 cache. Next, we propose an efficiency-based assignment algorithm which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring, an assignment for N GFs can be determined in (β-1)N(N+1)/2 steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.

A Performance Model and Efficiency-Based Assignment of Buffering Strategies for Automatic GPU Stencil Code Generation Yue Hu, David M. Koppelman, Steven R. Brandt, 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC), pp. 361-368, 2016.

Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation presents a number of challenges, one of which is the best way to make use of memory. GPUs have three types of on-chip storage: registers, shared memory, and read-only cache. The choice of type of storage and how it's used, a buffering strategy, for each stencil array (grid function, (GF)) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF ull. Our code-generation framework supports five buffering strategies. For a stencil computation with N GFs, the total number of possible assignments is 5^N. Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead. In this work, we present an analytic performance model for stencil computations on GPUs, and study the behavior of read-only cache and L2 cache. Next, we propose an efficiency-based assignment algorithm, which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring an assignment for N GFs and b strategy types can be determined in (b-1)N(N+1)/2 steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.

GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing Y. Fang, Y. Ding, W. Feinstein, D. Koppelman, J. Moreno, M. Jarrell J. Ramanujam, M. Brylinski, PLoS ONE 11(7) 2016. Hide Abstracts

GeauxDock: a Novel Approach for Mixed-Resolution Ligand Docking Using a Descriptor-Based Force Field Yun Ding, Ye Fang, Wei P. Feinstein, Jagannathan Ramanujam, David M. Koppelman, Juana Moreno, Michal Brylinski, and Mark Jarrell, Journal of Computational Chemistry, vol. 36, no. 27, pp. 2013-2026, October 15, 2015. Hide Abstracts

Model-Driven Auto-Tuning of Stencil Computations On GPUs Yue Hu, David M. Koppelman, Steven R. Brandt, and Frank Loeffler, Proceedings of the 2nd International Workshop on High-Performance Stencil Computations, held in Conjunction with HiPEAC, January 2015. Hide Abstracts

Chemora: a PDE Solving Framework for Modern HPC Architectures Erik Schnetter, Marek Blazewicz, Steven R. Brandt, David M. Koppelman, and Frank Löffler, Computing in Science and Engineering, vol. 17, no. 2, pp. 53-64, March/April 2015.

Modern HPC architectures consist of heterogeneous multi-core, many-node systems with deep memory hierarchies. Modern applications employ ever more advanced discretisation methods to study multi-physics problems. Developing such applications that explore cutting-edge physics on cutting-edge HPC systems has become a complex task that requires significant HPC knowledge and experience. Unfortunately, this combined knowledge is currently out of reach for all but a few groups of application developers.

Chemora is a framework for solving systems of Partial Differential Equations (PDEs) that targets modern HPC architectures. Chemora is based on Cactus, which sees prominent usage in the computational relativistic astrophysics community. In Chemora, PDEs are expressed either in a high-level LaTeX-like language or in Mathematica. Discretisation stencils are defined separately from equations, and can include Finite Differences, Discontinuous Galerkin Finite Elements (DGFE), Adaptive Mesh Refinement (AMR), and multi-block systems.

We use Chemora in the Einstein Toolkit to implement the Einstein Equations on CPUs and on accelerators, and study astrophysical systems such as black hole binaries, neutron stars, and core-collapse supernovae.

Discovering Barriers to Efficient Execution, Both Obvious and Subtle, Using Instruction-Level Visualization David M. Koppelman and Chris J. Michael, First Workshop on Visual Performance Analysis, New Orleans, Louisiana, USA, November 2014.

CPU performance is determined by the interaction between available resources, microarchitectural features, the execution of instructions, and by the data. These elements can interact in complex ways, making it difficult for those seeing only aggregate performance numbers, such as miss ratios and issue rates, to determine whether there are reasonable avenues for performance improvement. A technique called instruction-level visualization helps users connect these disparate elements by showing the timing of the execution of individual program instructions. The PSE visualization program enhances instruction-level visualization by showing which instructions contribute to execution inefficiency in a way that makes it easy to locate dependent instructions and the history of events affecting the instruction. A simple annotation system makes it easy for a user to attach custom information. PSE has been used for microarchitecture research, simulator debugging, and for instructional use.

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Marek Blazewicz, Ian Hinder, David M. Koppelman, Steven R. Brandt, Milosz Ciznick, Michal Kierzynka, Frank Löffler, Erik Schnetter, and Jian Tao, Scientific Programming, vol.~21, pp 1-16, 2013.

Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.

The Interaction and Relative Effectiveness of Hardware and Software Data Prefetch S. Verma and David M. Koppelman, Journal of Circuits, Systems, and Computers (special issue on Interaction between Compilers and Computer Architectures) vol. 21, no. 2, pp 1.21, 2012.

A major performance limiter in modern processors is the long latencies caused by data cache misses. Both compiler and hardware based prefetching schemes help hide these latencies and so improve performance. Compiler techniques infer memory access patterns through code analysis, and insert appropriate prefetch instructions. Hardware prefetching techniques work independently from the compiler by monitoring an access stream, detecting patterns in this stream and issuing prefetches based on these patterns. This paper looks at the interplay between compiler and hardware architecture based prefetching techniques. Does either techniques make the other one unnecessary? First, compilers' ability to achieve good results without extreme expertise is evaluated by preparing binaries with no prefetch, one-flag prefetch (no tuning), and expertly tuned prefetch. From runs of SPECcpu2006 binaries, we find that expertise avoids minor slowdown in a few benchmarks and provides substantial speedup in several. We compare software schemes to hardware prefetching schemes and our simulations show software alone substantially outperforms hardware alone on about half of a selection of benchmarks. In other cases hardware matches or exceeds software. Analysis reveals that in many cases hardware is not prefetching access patterns that it is capable of recognizing, due to some irregularities in the observed miss sequence. Hardware outperforms software on address sequences that the compiler would not guess. Using the two schemes together provides further benefits, but less than the sum of the contributions of each alone.

A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems Marek Blazewicz, Steven R. Brandt, Peter Diener, David M. Koppelman, Krzysztof Kurowski, Frank Loeffler, Erik Schnetter, and Jian Tao, International Conference on Parallel Computing, September 2011. Hide Abstracts

A Penalty-Sensitive Branch Predictor Yue Hu, David M. Koppelman, Lu Peng, The Journal of Instruction-Level Parallelism 2nd JILP Workshop on Computer Architecture Competitions (JWAC-2): Championship Branch Prediction, held in conjunction with The 38th International Symposium on Computer Architecture, June 2011.

Branch predictor design is typically focused only on minimizing the misprediction rate (MR), while ignores misprediction penalty. Because the misprediction penalty varies widely from branch to branch, performance might get improved by using a predictor that makes a greater effort to predict high-penalty branches, at the expense of the other, even if the total number of mispredictions doesn't change. A penalty-sensitive predictor was developed based on this idea. It includes a penalty predictor to predict whether a branch is high or low penalty. Then, a twoclass TAGE predictor is developed to favor high-penalty branches at the expense of low-penalty branches. Experiment shows although the overall performance improvement is limited, the penalty-sensitive mechanism successfully decreases the MR of the highpenalty branches while increasing the MR of the lowpenalty branches by a small amount.

Presentation (680 kB Powerpoint) Hide Abstracts

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma, David M. Koppelman, and Lu Peng, The Journal of Instruction-Level Parallelism, 13 (2011) 1-14 (January 2011).

A set of hybrid and adaptive prefetching schemes are considered in this paper. The prefetchers are hybrid in that they use combinations of Stride, Sequential and C/DC prefetching mechanisms. The schemes are adaptive in that their aggressiveness is adjusted based on feedback metrics collected dynamically during program execution. Metrics such as prefetching accuracy and prefetching timeliness are used to vary aggressiveness in terms of prefetch distance (how far ahead of the current miss it fetches) and prefetch degree (the number of prefetches issued). The scheme is applied separately at both the L1 and L2 cache levels. We previously proposed a Hybrid Adaptive prefetcher for the Data Prefetching Competition (DPC-1) which uses a hybrid PC-based Stride/Sequential prefetcher. In this work, we also consider hybrid schemes that use the CZone with Delta Correlations (C/DC) prefetcher. Further, we breakdown the components of hybrid adaptive prefetching by evaluating the individual benefits of using hybrid schemes and using adaptivity. While hybrid prefetching improves performance over a single prefetcher, we demonstrate that the main benefit of adaptivity is the reduction in the number of prefetches without hurting performance. For a set of 18 SPEC CPU 2006 benchmarks, the best hybrid adaptive scheme achieves an average (geometric mean) CPI improvement of 35% compared to no prefetching and 19% compared to a basic Stride prefetcher. Adaptivity makes it possible to reduce the number of prefetches issued by as much as 50% in some benchmarks without sacrificing performance.

The Effects On Branch Prediction When Utilizing Control Independence Chris J. Michael and David M. Koppelman, The IEEE International Parallel & Distributed Processing Symposium PhD Forum, April 2010.

Though current general-purpose processors have several small CPU cores as opposed to a single more complex core, many algorithms and applications are inherently sequential and so hard to explicitly parallelize. Cores designed to handle these problems may exhibit deeper pipelines and wider fetch widths to exploit instruction-level parallelism via out-of-order execution. As these parameters increase, so does the amount of instructions fetched along an incorrect path when a branch is mispredicted. Some instructions are fetched regardless of the direction of a branch. In current conventional CPUs, these instructions are always squashed upon branch misprediction and are fetched again shortly thereafter. Recent research efforts explore lessening the effect of branch mispredictions by retaining these instructions when squashing or fetching them in advance when encountering a branch that is difficult to predict. Though these control independent processors are meant to lessen the damage of misprediction, an inherent side-effect of fetching out of order, branch weakening, reduces realized speedup and is in part responsible for lowering potential speedup. This study formally defines and works towards identifying the causes of branch weakening. The overall goal of the research is to determine how much weakening is avoidable and develop techniques to help reduce weakening in control independent processors.

Spotlight--a Low Complexity Highly Accurate Profile-Based Branch Predictor Santhosh Verma, Benjamin Maderazo and David M. Koppelman, in the proceeding of the 28th IEEE International Performance Computing and Communications Conference, Phoenix, AZ, December 2009.

In an effort to achieve the high prediction accuracy needed to attain high instruction throughputs, branch predictors proposed in the literature and used in real systems have become increasingly more complex and larger over time. This is not consistent with the anticipated trend of simpler and more numerous cores in future multi-core processors. We introduce the Spotlight Branch predictor, a novel profile-based predictor which is able to achieve high prediction accuracy despite its simple design. Spotlight achieves high accuracy because complex decisions in the prediction process are made during an OS managed, one-time profile run instead of using complex hardware. We show that Spotlight achieves higher accuracy than Gshare as well as highly accurate and implementable predictors such as YAGS and the Hybrid Bimodal-Gshare predictor. It achieves an average reduction in misprediction rate of 20% over Gshare, 11% over Elastic History Buffer, 14% over Yags and 10% over Hybrid for a hardware budget of 8 kB. Spotlight is also compared to two difficult to implement neural predictors, the Path-based Neural and the Hashed Perceptron. It outperforms the Path-based Neural predictor at all sizes and the Hashed Perceptron at smaller hardware budgets. These results demonstrate that a simple profile-based predictor can achieve many of the benefits of more complex predictors. We also show that a single cycle latency implementation of Spotlight can be achieved without sacrificing accuracy by using an upstream replacement scheme.

Memory Hierarchy Performance Measurement of Commercial Dual-Core Desktop Processors L. Peng, J-K Peir, T. K. Prakash, C. Staelin, Y-K. Chen, and D. Koppelman, Journal of Systems Architecture, vol 54, no. 8, pp. 816-828, Aug. 2008.

As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors. In this paper, performance measurement on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64 × 2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multi-program-med and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for better performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as the shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4 without taking the advantage of on-chip communication.

Memory Performance and Scalability of Intel's and AMD's Dual-Core Processors: a Case Study Lu Peng, Jih-Kwon Peir, Tribuvan K. Prakash, Yen-Kuang Chen, and David Koppelman, the 26th IEEE International Performance Computing and Communications Conference (IPCCC), New Orleans, LA, Apr. 2007. Hide Abstracts

The Impact of Fetch Rate and Reorder Buffer Size On Speculative Pre-Execution David M. Koppelman, Workshop on Duplicating, Deconstructing, and Debunking Held in conjunction with the 30th International Symposium on Computer Architecture June 8, 2003, pp. 64-75.

Pre-execution systems reduce the impact of cache misses and branch mispredictions by forking a slice, a code fragment derived from the program, in advance of frequently mispredicted branches and frequently missing loads in order to either resolve the branch or prefetch the load. Because unnecessary instructions are omitted the slice reaches the branch or load before the main thread does, for loads this time margin can reduce or even eliminate cache miss delay. Published results have shown significant improvements for some benchmarks, on the order of 20%, with many showing at least single-digit improvements. These studies left unexamined two system parameters that one would expect pre-execution to be sensitive to: fetch rate and reorder buffer size. Higher fetch rate would allow the main thread to reach the troublesome load sooner, but would not affect the slice and so the slice's margin is reduced. Studies have shown large potential margins for slices, but the fetch rate effect has not been measured. A second system parameter is reorder buffer size. A larger reorder buffer would allow a system to hide more of the miss latency that pre-execution reduces. To test the sensitivity to these factors pre-execution schemes were simulated on systems with varying fetch rates and reorder buffer sizes. Results show that higher fetch rate does not reduce pre-execution speedup in most benchmarks. Reorder buffer size sensitivity varies, some benchmarks are insensitive to reorder buffer size increases beyond 256 entries, but still benefit from pre-execution, the benefit due in large part to prefetching those loads that provide values for frequently mispredicted branches. The benchmarks that are sensitive to reorder buffer size are also the ones that benefit most from pre-execution.

Presentation (813 kB PDF) Hide Abstracts

The Benefit of Multiple Branch Prediction On Dynamically Scheduled Systems David M. Koppelman, Workshop on Duplicating, Deconstructing, and Debunking Held in conjunction with the 29th International Symposium on Computer Architecture May 26, 2002, pp. 42-51.

Providing instructions for wide-issue systems is a challenge because the execution path must be predicted before branches are even decoded. Since basic blocks can easily be smaller than the issue width of near future systems multiple branch prediction is important. An early solution to the problem is the multiple branch predictor (MBP) of Yeh, Marr, and Patt. A PC-indexed branch address cache (BAC) provides information on all basic blocks within a number, say 3, basic blocks of the PC; a path through the blocks is chosen by a global-history branch predictor. The published evaluation did not include overall performance numbers, a shortcoming shared by many subsequent studies of similar systems. To fill this gap dynamically scheduled processors using a MBP or a conventional mechanism were simulated using RSIM. For comparison a multiple branch predictor based on superblocks, described by Reinman, Austin, and Calder was also simulated. Compared to single branch prediction, 8-way superscalar MBP systems show speedup of 10% on some SPECcpu integer benchmarks on a resource-constrained system, the superblock predictor realizes an improvement of 8%. MBP tree depths of 2 and 3 show nearly equal performance. As one would expect, MBP systems fill more decode slots, but data show that much of the performance gain is due to providing more time for dependent integer instructions. Much smaller are gains due to early initiation of memory operations (except when using fast caches) and a less frequently empty reorder buffer.

Presentation (455 kB PDF) Hide Abstracts

Neighborhood Prefetching On Multiprocessors Using Instruction History David M. Koppelman, International Conference on Parallel Architectures and Compilation Techniques, October 2000, pp. 123-132.

A multiprocessor prefetch scheme is described in which a miss is followed by a prefetch of a group of lines, a neighborhood, surrounding the demand-fetched line. The neighborhood is based on the data address and the past behavior of the access instruction that missed the cache. A neighborhood for an instruction that misses the cache is constructed by recording the offsets of addresses that subsequently miss. This neighborhood prefetching can exploit sequential access as can sequential prefetch and can to some extent exploit stride access, as can stride prefetch. Unlike stride and sequential prefetch it can support irregular access patterns, for example, accessing some subset of a structure's members. Neighborhood prefetching was compared to adaptive sequential prefetching using execution-driven simulation. Results show more useful prefetches, and lower execution time, for neighborhood prefetching for six of eight SPLASH-2 benchmarks. On eight SPLASH-2 benchmarks the average normalized execution time is less than 0.9, for three benchmarks, less than 0.8.

Longer Version (211 kB PDF) Presentation (272 kB PDF) Hide Abstracts

Using Predictions of Non-Regular Memory Access Patterns to Improve Multiprocessor Cache Performance (Neighborhood Prefetching) David M. Koppelman, LSU ECE Department Seminar, March 1999.

Computer engineers are suffering in a hell partly of their own creation: while the time needed to access memory has slowly dropped, the number of instructions that can be executed during that time has increased exponentially. Since execution must sometimes wait for the outcome of a memory access, efforts at improving instruction execution speed are increasingly futile. This memory access latency can be partly avoided by caching data in a small, expensive, high-speed memory. Memory latency is still a problem because needed data is not always in the cache. An aggressive solution to the problem is prefetching: bringing data into the cache before it is needed using predictions of memory access addresses. Prefetch schemes reported in the literature predict consecutive or arithmetic address sequences, for example, 1000, 1025, 1050, 1075, etc. A more flexible scheme is described here: a neighborhood is found for memory access instructions based on their past behavior. Address predictions are made using a neighborhood, these can form sequential or stride sequences, but can also form irregular sequences, increasing the number of useful prefetches and reducing the number of useless ones. This neighborhood prefetching was evaluated by execution-driven simulation of shared-memory parallel computers (multiprocessors) running SPLASH 2 benchmark programs. Performance improvements of up to 25% were obtained on neighborhood prefetching schemes having storage requirements of only 7% that of the cache itself.

Speculative Multiprocessor Cache Line Actions Using Instruction and Line History David M. Koppelman, The 10th International Conference on Parallel & Distributed Computing Systems, New Orleans, Louisiana, October 1997, pp. 401-406.

A technique is described for reducing miss latency in coherent-cache shared-memory parallel computers. Miss latency is reduced by speculatively invalidating and updating (copying an exclusively held line back to memory) cache lines at one processor that might be needed at another processor. A line becomes a candidate for speculative invalidation when another line last accessed by the same instruction is invalidated. A line becomes a candidate for speculative updating under corresponding conditions. The technique can be implemented by constructing linked lists of lines for each memory access instruction. The amount of memory needed by an implementation is less than 50% the size of the cache, roughly comparable to the cost of adding error-correcting memory. No time need be added to cache hits. In execution-driven simulations of such systems running programs from the Splash 2 suite invalidations and updates are reduced by 50% or more. Total time spent servicing misses is reduced by about 20% on 16-processor systems, execution time is reduced as much as 20% for some benchmarks on high latency systems.

Presentation (225 kB PDF) Hide Abstracts

A Multiprocessor Memory Processor for Efficient Sharing and Access Coordination David M. Koppelman, Workshop on Mixing Logic and DRAM, International Symposium on Computer Architecture, June 1997.

The growing disparity between instruction issue rates and memory access speed impacts multiprocessors especially hard under certain circumstances. To alleviate the problem a system is described here in which smart memory chips can execute simple operations so that certain tasks can be completed with less contention, fewer messages, or by avoiding synchronization that would otherwise be necessary. These operations, issued in only a few cycles by a CPU, direct the memory to read, modify, and write a memory location. Tag values can be used to delay completion of an operation until the needed tag value is set. Operations can have multiple steps, and can be sent between memories to complete. A completion counter at the issuing CPU can be used to wait until in-progress operations have completed. Demonstrating their usefulness, execution-driven simulation of such systems shows speedup of well over two times on code fragments chosen for their suitability. The radix sorting program from the SPLASH-2 benchmark shows over 29% reduction in execution time.

Presentation (319 kB PDF) Hide Abstracts

Sticky States in Banyan Network Queues and Their Application to Analysis David M. Koppelman, The Journal of Parallel and Distributed Computing, vol. 40, pp. 147-161, 1997.

A finite-buffered banyan network analysis technique designed to model networks at high traffic loads is presented. The technique specially models two states of the network queues: queue empty and queue congested (roughly, zero or one slots free). A congested queue tends to stay congested because packets bound for the queue accumulate in the previous stage. The expected duration of this state is computed using a probabilistic model of a switching module feeding the congested queue. A technique for finding a lower arrival rate to an empty queue is also described. The queues themselves are modeled using independent Markov chains with an additional congested state added. The new analysis technique is novel in its modeling the higher arrival rate to a congested queue and a lower arrival rate to an empty queue. Comparison of queue state distributions obtained with the analysis and simulation show that an important feature of congestion is modeled.

A Family of Networks for Non-Uniform Traffic David M. Koppelman, IEEE Transactions on Parallel and Distributed Systems vol. 7, no. 5, pp. 486-492, May 1996.

New networks, called GLO networks, are constructed by adding bus-like links to omega networks, providing additional capacity between cells on momentarily busy paths. Equivalent pin-count GLO and omega networks offered uniform and nonuniform traffic were simulated. GLO networks exhibited lower latency for nonuniform traffic and light to moderate uniform traffic.

A Lower Bound On the Average Physical Length of Edges in the Physical Realization of Graphs David M. Koppelman, Parallel Processing Letters, vol. 6, no. 1, pp. 137-143, 1996.

[Summary of introduction.] A graph realization is an assignment of positional and other information to the elements of a graph; it can be used to bound the performance of an interconnection network based on the graph. Vitanyi describes a graph realization where processors have bounded volume; a lower bound on the average physical edge length is derived for such graph realizations. The bound is based on the graphs diameter, symmetry, and path distribution. The bound is easily applied to many popular graphs such as trees, d-dimensional toroidal meshes, and cube-connected cycles. The bounds are not easy to apply to less-symmetric graphs, in particular the de Bruijn and shuffle-exchange graphs. A modified version of Vitanyi bound, easily applied to the de Bruijn and other graphs, is presented. Also presented is a more precise realization definition.

Congested Banyan Network Analysis Using Congested-Queue States and Neighboring-Queue Effects David M. Koppelman, IEEE/ACM Transactions on Networking, vol. 4, no. 1, pp. 106-111, February 1996.

A banyan network analysis technique is presented which more accurately models a congested network than other reported techniques. The analysis is based on the observation that a full queue (within the switching modules making up the network) causes traffic to back up. For a short time after becoming full the free space in the queue is limited to no more than one slot. A queue in such a condition is called congested. Because of blocking the arrival rate to a congested queue is higher; this tends to maintain congestion. The arrival rate to a congested queues dual is lower as is the service rate for queues feeding the congested queue. These effects are captured in the analysis. The state model used for a queue encodes congested as well as normal operation. Further, the model codes whether the connected next-stage and dual queues are congested. Network throughput computed with the model is closer to that obtained from simulations than other banyan analyses appearing in the literature, including those designed to model congestion. Further, the queue-occupancy distributions are much closer than other analyses, suggesting that the analysis better models conditions in congested banyan networks.

Related technical report Congested Banyan Network Analysis Using Congested-Queue States and Neighboring-Queue Effects, David M. Koppelman, Technical Report, Louisiana State University, Baton Rouge, LA, September 1995 Manuscript (275 kB PDF)

An Analysis of Banyan Networks Offered Traffic with Geometrically Distributed Message Lengths I-Pyen Lyaw and David M. Koppelman, IEE Proceedings - Communications, vol. 142, no. 5, pp. 285-291, October 1995.

An analysis of finite-input-buffered banyan networks offered traffic having geometrically distributed message lengths is presented. This is one of the few multistage-network analyses for networks offered non-unit-length messages and is the only one that the authors are aware of for finite-input-buffered banyan networks. In the analysis, network switching elements are modeled using two state-machines, one for queue heads (HOLs), the other for entire queues. A network is modeled using one switching-element model to represent each stage. Together these model the effect that non-unit-length messages have on banyans. Solutions are obtained iteratively. Network performance figures were obtained with this analysis and compared to simulation results. The figures show that the analysis can predict the effect of message length on throughput and delay, including the performance degradation caused by longer messages.

Reducing PE/Memory Traffic in Shared Memory Multiprocessors by the Difference Coding of Memory Addresses David M. Koppelman, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 11, pp. 1156-1168, November 1994.

A method of reducing the volume of data flowing through the network in a shared memory parallel computer (multiprocessor) is described. The reduction is achieved by difference coding the memory addresses in messages sent between processing elements (PEs) and memories. In an implementation each PE would store the last address sent to each memory and vice versa. Messages that would normally contain an address instead contain the difference between the address associated with the current and most recent messages.

Trace driven simulation shows that only 70% or less of traffic volume (including data and overhead) is necessary, even in systems using coherent caches. The reduction in traffic could result in a lower cost or lower latency network. The cost of the hardware to achieve this is small, and the delay added is insignificant compared to network latency.

Reducing PE/Memory Traffic in Shared Memory Multiprocessors by the Difference Coding of Memory Addresses

David M. Koppelman, in International Conference on Parallel Processing, August 1992, vol. I, pp. 53-56 Hide Abstracts

A Self-Routing Permutation Network David M. Koppelman and A. Yavuz Oruç, Journal of Parallel and Distributed Computing, vol. 10, no. 1, pp. 140-151, 1990.

A self-routing permutation network is a connector which can set its own switches to realize any one-to-one mapping of its inputs onto its outputs. Many permutation networks have been reported in the literature, but none with the self-routing property, except crossbars and cellular permutation arrays which have excessive cost. This paper describes a self-routing permutation network which has O(log³n) bit-level delay and uses O(n log³ n) bit-level hardware where n is the number of inputs to the network. The network is derived from a complementary Benes network by replacing each of its two switches in its first stage by what is called a 1-sorter, and recursively defining the switches in the third stage as self-routing networks. The use of 1-sorters results in substantial reduction in both propagation delay and hardware cost when contrasted with O(n) delay and O(n^1.59) hardware of the recursively decomposed version of a complementary Benes network. Furthermore, these complexities match the propagation delay and hardware cost of Batchers sorters (only networks, other than crossbars and cellular permutation arrays, which are known to behave like self-routing permutation networks.) More specifically, it is shown that the network of this paper uses about half of the hardware with about four-thirds of the delay of a Batchers sorter.

A Self-Routing Permutation Network

David M. Koppelman and A. Yavuz Oruç, in International Conference on Parallel Processing, August 1989, vol. I, pp. 288-295Hide Abstracts

David M. Koppelman - koppel@ece.lsu.edu	Modified 2 Jul 2019 11:51 (1651 UTC)
Provide Website Feedback • Accessibility Statement • Privacy Statement