GPU Road Network Graph Contraction and SSSP Query
Roozbeh Karimi, David M. Koppelman, and Chris J. Michael,
Proceedings of the ACM International Conference on Supercomputing, June 2019.
PHAST is to date one of the fastest algorithms for performing
single source shortest path (SSSP) queries on road-network
graphs. PHAST operates on graphs produced in part using Geisberger's
contraction hierarchy (CH) algorithm. Producing these graphs is
time consuming, limiting PHAST's usefulness when graphs are not
available in advance. CH iteratively assigns scores to nodes,
contracts (removes) the highest-scoring node, and adds shortcut edges
to preserve distances. Iteration stops when only one node
remains. Scoring and contraction rely on a
witness path search
(WPS) of nearby nodes. Little work has been reported on parallel and
especially GPU CH algorithms. This is perhaps due to issues such as
the validity of simultaneous potentially overlapping searches, score
staleness, and parallel graph updates.
A GPU contraction algorithm, CUCH, is presented which
overcomes these difficulties by partitioning the graph into levels
composed of independent sets of nodes (non-adjacent nodes) with
similar scores. This allows contracting multiple nodes simultaneously
with little coordination between threads. A GPU-efficient WPS is
presented in which a small neighborhood is kept in shared memory and a
hash table is used to detect path overlap. Low-parallelism regions of
contraction and query are avoided by halting contraction early and
computing APSP on the remaining graph. A PHAST-like query computes
SSSP using this contracted graph. Contraction of some DIMACS road
network graphs on an Nvidia P100 GPU achieves a speedup of 20 to
37 over Geisberger's serial code on a Xeon E5-2640 v4. Query times
on CUCH- and CH-contracted graphs were comparable.
Hide Abstracts
Thoroughly Exploring GPU Buffering Options for Stencil Code by Using
An Efficiency Measure and a Performance Model
Yue Hu, David M. Koppelman, and Steven R. Brandt,
IEEE Transactions on Multi-Scale Computing Systems,
vol. 4, no. 3, pp. 477-490, July-September 2018.
Stencil computations form the basis for computer simulations across
almost every field of science, such as computational fluid dynamics,
data mining, and image processing. Their mostly regular data access
patterns potentially enable them to take advantage of the high
computation and data bandwidth of GPUs, but only if data buffering and
other issues are handled properly.
Finding a good code generation strategy presents a number of
challenges, one of which is the best way to make use of memory. GPUs
have several types of on-chip storage including registers, shared
memory, and a read-only cache. The choice of type of storage and how
it's used, a buffering strategy, for each stencil array
(grid function, [GF]) not only requires a good understanding of
its stencil pattern, but also the efficiency of each type of storage
for the GF, to avoid squandering storage that would be more beneficial
to another GF
ull. For a stencil computation with N GFs, the total
number of possible assignments is βN, where β is the
number of buffering strategies. Our code-generation framework supports
five buffering strategies (β=5). Large, complex stencil kernels
may consist of dozens of GFs, resulting in significant search
overhead.
In this work, we present an analytic performance model for stencil
computations on GPUs and study the behavior of read-only cache and L2
cache. Next, we propose an efficiency-based assignment algorithm which
operates by scoring a change in buffering strategy for a GF using a
combination of (a) the predicted execution time and (b) on-chip
storage usage. By using this scoring, an assignment for N GFs can be
determined in (β-1)N(N+1)/2 steps. Results show that the
performance model has good accuracy and that the assignment strategy
is highly efficient.
Hide Abstracts
A Performance Model and Efficiency-Based Assignment of
Buffering Strategies for Automatic GPU Stencil Code Generation
Yue Hu, David M. Koppelman, Steven R. Brandt,
2016 IEEE 10th International Symposium on
Embedded Multicore/Many-core Systems-on-Chip (MCSOC), pp. 361-368, 2016.
Stencil computations form the basis for computer simulations across
almost every field of science, such as computational fluid dynamics,
data mining, and image processing. Their mostly regular data access
patterns potentially enable them to take advantage of the high
computation and data bandwidth of GPUs, but only if data buffering and
other issues are handled properly.
Finding a good code generation presents a number of challenges, one of
which is the best way to make use of memory. GPUs have three types of
on-chip storage: registers, shared memory, and read-only cache. The
choice of type of storage and how it's used, a buffering
strategy, for each stencil array (grid function, (GF)) not
only requires a good understanding of its stencil pattern, but also
the efficiency of each type of storage for the GF, to avoid
squandering storage that would be more beneficial to another GF
ull.
Our code-generation framework supports five buffering strategies. For
a stencil computation with N GFs, the total number of possible
assignments is 5N. Large, complex stencil kernels may consist of
dozens of GFs, resulting in significant search overhead.
In this work, we present an analytic performance model for stencil
computations on GPUs, and study the behavior of read-only cache and L2
cache. Next, we propose an efficiency-based assignment algorithm,
which operates by scoring a change in buffering strategy for a GF
using a combination of (a) the predicted execution time and (b)
on-chip storage usage. By using this scoring an assignment for N GFs
and b strategy types can be determined in (b-1)N(N+1)/2 steps.
Results show that the performance model has good accuracy and that the
assignment strategy is highly efficient.
Hide Abstracts
GeauxDock: a Novel Approach for Mixed-Resolution Ligand Docking
Using a Descriptor-Based Force Field
Yun Ding, Ye Fang, Wei P. Feinstein, Jagannathan Ramanujam,
David M. Koppelman, Juana Moreno, Michal Brylinski, and Mark Jarrell,
Journal of Computational Chemistry,
vol. 36, no. 27, pp. 2013-2026, October 15, 2015.
Hide Abstracts
Chemora: a PDE Solving Framework for Modern HPC Architectures
Erik Schnetter, Marek Blazewicz, Steven R. Brandt, David M. Koppelman, and Frank Löffler,
Computing in Science and Engineering, vol. 17, no. 2, pp. 53-64, March/April 2015.
Modern HPC architectures consist of heterogeneous multi-core,
many-node systems with deep memory hierarchies. Modern applications
employ ever more advanced discretisation methods to study
multi-physics problems. Developing such applications that explore
cutting-edge physics on cutting-edge HPC systems has become a
complex task that requires significant HPC knowledge and experience.
Unfortunately, this combined knowledge is currently out of reach for all
but a few groups of application developers.
Chemora is a framework for solving systems of Partial Differential
Equations (PDEs) that targets modern HPC architectures. Chemora is
based on Cactus, which sees prominent usage in the computational
relativistic astrophysics community. In Chemora, PDEs are expressed
either in a high-level LaTeX-like language or in Mathematica.
Discretisation stencils are defined separately from equations, and
can include Finite Differences, Discontinuous Galerkin Finite
Elements (DGFE), Adaptive Mesh Refinement (AMR), and multi-block
systems.
We use Chemora in the Einstein Toolkit to implement the Einstein
Equations on CPUs and on accelerators, and study astrophysical
systems such as black hole binaries, neutron stars, and
core-collapse supernovae.
Hide Abstracts
Discovering Barriers to Efficient Execution,
Both Obvious and Subtle,
Using Instruction-Level Visualization
David M. Koppelman and Chris J. Michael,
First
Workshop on Visual Performance Analysis,
New Orleans, Louisiana, USA, November 2014.
CPU performance is determined by the interaction between available
resources, microarchitectural features, the execution of instructions,
and by the data. These elements can interact in complex ways, making
it difficult for those seeing only aggregate performance numbers, such
as miss ratios and issue rates, to determine whether there are
reasonable avenues for performance improvement.
A technique called instruction-level visualization helps users connect
these disparate elements by showing the timing of the execution of
individual program instructions.
The PSE visualization program enhances instruction-level
visualization by showing which instructions contribute to execution
inefficiency in a way that makes it easy to locate dependent
instructions and the history of events affecting the instruction. A
simple annotation system makes it easy for a user to attach custom
information. PSE has been used for microarchitecture research,
simulator debugging, and for instructional use.
Hide Abstracts
From Physics Model to Results: An Optimizing Framework for
Cross-Architecture Code Generation
Marek Blazewicz, Ian Hinder, David M. Koppelman, Steven R. Brandt, Milosz Ciznick, Michal Kierzynka, Frank Löffler, Erik Schnetter, and Jian Tao,
Scientific Programming, vol.~21, pp 1-16, 2013.
Starting from a high-level problem description in terms of
partial differential equations using abstract tensor notation, the
Chemora framework discretizes, optimizes, and generates complete high
performance codes for a wide range of compute architectures. Chemora
extends the capabilities of Cactus, facilitating the usage of
large-scale CPU/GPU systems in an efficient manner for complex
applications, without low-level code tuning. Chemora achieves
parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient
loop traversal strategies, dynamically selected data and instruction
cache usage strategies, and JIT compilation of GPU code tailored to
the problem characteristics. The discretization is based on
higher-order finite differences on multi-block domains. Chemora's
capabilities are demonstrated by simulations of black hole
collisions. This problem provides an acid test of the framework, as
the Einstein equations contain hundreds of variables and thousands of
terms.
Hide Abstracts
The Interaction and Relative Effectiveness of Hardware and Software
Data Prefetch
S. Verma and David M. Koppelman,
Journal of Circuits, Systems, and Computers
(special issue on Interaction between Compilers and
Computer Architectures) vol. 21, no. 2, pp 1.21, 2012.
A major performance limiter in modern processors is the long latencies
caused by data cache misses. Both compiler and hardware based
prefetching schemes help hide these latencies and so improve
performance. Compiler techniques infer memory access patterns through
code analysis, and insert appropriate prefetch instructions. Hardware
prefetching techniques work independently from the compiler by
monitoring an access stream, detecting patterns in this stream and
issuing prefetches based on these patterns. This paper looks at the
interplay between compiler and hardware architecture based prefetching
techniques. Does either techniques make the other one unnecessary?
First, compilers' ability to achieve good results without extreme
expertise is evaluated by preparing binaries with no prefetch,
one-flag prefetch (no tuning), and expertly tuned prefetch. From runs
of SPECcpu2006 binaries, we find that expertise avoids minor slowdown
in a few benchmarks and provides substantial speedup in several.
We compare software schemes to hardware prefetching schemes and our
simulations show software alone substantially outperforms hardware
alone on about half of a selection of benchmarks. In other cases
hardware matches or exceeds software. Analysis reveals that in many
cases hardware is not prefetching access patterns that it is capable
of recognizing, due to some irregularities in the observed miss
sequence. Hardware outperforms software on address sequences that the
compiler would not guess. Using the two schemes together provides
further benefits, but less than the sum of the contributions of each
alone.
Hide Abstracts
A Penalty-Sensitive Branch Predictor
Yue Hu, David M. Koppelman, Lu Peng,
The Journal of Instruction-Level Parallelism 2nd
JILP Workshop on Computer Architecture Competitions (JWAC-2):
Championship Branch Prediction, held in conjunction with The 38th
International Symposium on Computer Architecture, June 2011.
Branch predictor design is typically focused only on
minimizing the misprediction rate (MR), while ignores
misprediction penalty. Because the misprediction penalty
varies widely from branch to branch, performance might
get improved by using a predictor that makes a greater
effort to predict high-penalty branches, at the expense of
the other, even if the total number of mispredictions
doesn't change.
A penalty-sensitive predictor was developed based
on this idea. It includes a penalty predictor to predict
whether a branch is high or low penalty. Then, a twoclass
TAGE predictor is developed to favor high-penalty
branches at the expense of low-penalty branches.
Experiment shows although the overall performance
improvement is limited, the penalty-sensitive
mechanism successfully decreases the MR of the highpenalty
branches while increasing the MR of the lowpenalty
branches by a small amount.
Presentation (680 kB Powerpoint)
Hide Abstracts
Efficient Prefetching with Hybrid Schemes and Use of Program
Feedback to Adjust Prefetcher Aggressiveness
Santhosh Verma, David M. Koppelman, and Lu Peng,
The Journal of Instruction-Level Parallelism, 13 (2011) 1-14 (January 2011).
A set of hybrid and adaptive prefetching schemes are considered in this
paper. The prefetchers are hybrid in that they use combinations of
Stride, Sequential and C/DC prefetching mechanisms. The schemes are
adaptive in that their aggressiveness is adjusted based on feedback
metrics collected dynamically during program execution. Metrics such
as prefetching accuracy and prefetching timeliness are used to vary
aggressiveness in terms of prefetch distance (how far ahead of the
current miss it fetches) and prefetch degree (the number of prefetches
issued). The scheme is applied separately at both the L1 and L2 cache
levels. We previously proposed a Hybrid Adaptive prefetcher for the
Data Prefetching Competition (DPC-1) which uses a hybrid PC-based
Stride/Sequential prefetcher. In this work, we also consider hybrid
schemes that use the CZone with Delta Correlations (C/DC)
prefetcher. Further, we breakdown the components of hybrid adaptive
prefetching by evaluating the individual benefits of using hybrid
schemes and using adaptivity. While hybrid prefetching improves
performance over a single prefetcher, we demonstrate that the main
benefit of adaptivity is the reduction in the number of prefetches
without hurting performance. For a set of 18 SPEC CPU 2006 benchmarks,
the best hybrid adaptive scheme achieves an average (geometric mean)
CPI improvement of 35% compared to no prefetching and 19% compared to
a basic Stride prefetcher. Adaptivity makes it possible to reduce the
number of prefetches issued by as much as 50% in some benchmarks
without sacrificing performance.
Hide Abstracts
The Effects On Branch Prediction When Utilizing Control Independence
Chris J. Michael and David M. Koppelman,
The IEEE International Parallel & Distributed Processing Symposium PhD Forum,
April 2010.
Though current general-purpose processors have several small CPU cores
as opposed to a single more complex core, many algorithms and
applications are inherently sequential and so hard to explicitly
parallelize. Cores designed to handle these problems may exhibit
deeper pipelines and wider fetch widths to exploit instruction-level
parallelism via out-of-order execution. As these parameters increase,
so does the amount of instructions fetched along an incorrect path
when a branch is mispredicted. Some instructions are fetched
regardless of the direction of a branch. In current conventional CPUs,
these instructions are always squashed upon branch misprediction and
are fetched again shortly thereafter. Recent research efforts explore
lessening the effect of branch mispredictions by retaining these
instructions when squashing or fetching them in advance when
encountering a branch that is difficult to predict. Though these
control independent processors are meant to lessen the damage of
misprediction, an inherent side-effect of fetching out of order,
branch weakening, reduces realized speedup and is in part responsible
for lowering potential speedup. This study formally defines and works
towards identifying the causes of branch weakening. The overall goal
of the research is to determine how much weakening is avoidable and
develop techniques to help reduce weakening in control independent
processors.
Hide Abstracts
Spotlight--a Low Complexity Highly Accurate Profile-Based Branch Predictor
Santhosh Verma, Benjamin Maderazo and David M. Koppelman,
in the proceeding of the 28th IEEE International
Performance Computing and Communications Conference, Phoenix, AZ,
December 2009.
In an effort to achieve the high prediction accuracy needed to attain
high instruction throughputs, branch predictors proposed in the
literature and used in real systems have become increasingly more
complex and larger over time. This is not consistent with the
anticipated trend of simpler and more numerous cores in future
multi-core processors. We introduce the Spotlight Branch predictor, a
novel profile-based predictor which is able to achieve high prediction
accuracy despite its simple design. Spotlight achieves high accuracy
because complex decisions in the prediction process are made during an
OS managed, one-time profile run instead of using complex hardware. We
show that Spotlight achieves higher accuracy than Gshare as well as
highly accurate and implementable predictors such as YAGS and the
Hybrid Bimodal-Gshare predictor. It achieves an average reduction in
misprediction rate of 20% over Gshare, 11% over Elastic History
Buffer, 14% over Yags and 10% over Hybrid for a hardware budget of 8
kB. Spotlight is also compared to two difficult to implement neural
predictors, the Path-based Neural and the Hashed Perceptron. It
outperforms the Path-based Neural predictor at all sizes and the
Hashed Perceptron at smaller hardware budgets. These results
demonstrate that a simple profile-based predictor can achieve many of
the benefits of more complex predictors. We also show that a single
cycle latency implementation of Spotlight can be achieved without
sacrificing accuracy by using an upstream replacement scheme.
Hide Abstracts
Memory Hierarchy Performance Measurement of Commercial
Dual-Core Desktop Processors
L. Peng, J-K Peir, T. K. Prakash, C. Staelin, Y-K. Chen, and D. Koppelman,
Journal of Systems Architecture,
vol 54, no. 8, pp. 816-828, Aug. 2008.
As chip multiprocessor (CMP) has become the mainstream in processor
architectures, Intel and AMD have introduced their dual-core
processors. In this paper, performance measurement on an Intel Core 2
Duo, an Intel Pentium D and an AMD Athlon 64 × 2 processor are
reported. According to the design specifications, key derivations
exist in the critical memory hierarchy architecture among these
dual-core processors. In addition to the overall execution time and
throughput measurement using both multi-program-med and multi-threaded
workloads, this paper provides detailed analysis on the memory
hierarchy performance and on the performance scalability between
single and dual cores. Our results indicate that for better
performance and scalability, it is important to have (1) fast
cache-to-cache communication, (2) large L2 or shared capacity, (3)
fast L2 to core latency, and (4) fair cache resource sharing. Three
dual-core processors that we studied have shown benefits of some of
these factors, but not all of them. Core 2 Duo has the best
performance for most of the workloads because of its microarchitecture
features such as the shared L2 cache. Pentium D shows the worst
performance in many aspects due to its technology-remap of Pentium 4
without taking the advantage of on-chip communication.
Hide Abstracts
The Impact of Fetch Rate and Reorder Buffer Size On
Speculative Pre-Execution
David M. Koppelman,
Workshop on Duplicating,
Deconstructing, and Debunking Held in conjunction with the 30th
International Symposium on Computer Architecture
June 8, 2003, pp. 64-75.
Pre-execution systems reduce the impact of cache misses and branch
mispredictions by forking a slice, a code fragment derived from the
program, in advance of frequently mispredicted branches and frequently
missing loads in order to either resolve the branch or prefetch the
load. Because unnecessary instructions are omitted the slice reaches
the branch or load before the main thread does, for loads this time
margin can reduce or even eliminate cache miss delay.
Published results have shown significant improvements for some
benchmarks, on the order of 20%, with many showing at least
single-digit improvements. These studies left unexamined two system
parameters that one would expect pre-execution to be sensitive to:
fetch rate and reorder buffer size. Higher fetch rate would allow the
main thread to reach the troublesome load sooner, but would not affect
the slice and so the slice's margin is reduced. Studies have shown
large potential margins for slices, but the fetch rate effect has not
been measured. A second system parameter is reorder buffer size. A
larger reorder buffer would allow a system to hide more of the miss
latency that pre-execution reduces.
To test the sensitivity to these factors pre-execution schemes were
simulated on systems with varying fetch rates and reorder buffer
sizes. Results show that higher fetch rate does not reduce
pre-execution speedup in most benchmarks. Reorder buffer size
sensitivity varies, some benchmarks are insensitive to reorder buffer
size increases beyond 256 entries, but still benefit from
pre-execution, the benefit due in large part to prefetching those
loads that provide values for frequently mispredicted branches. The
benchmarks that are sensitive to reorder buffer size are also the ones
that benefit most from pre-execution.
Presentation (813 kB PDF)
Hide Abstracts
The Benefit of Multiple Branch Prediction On Dynamically Scheduled
Systems
David M. Koppelman,
Workshop on Duplicating,
Deconstructing, and Debunking Held in conjunction with the 29th
International Symposium on Computer Architecture
May 26, 2002, pp. 42-51.
Providing instructions for wide-issue systems is a challenge because
the execution path must be predicted before branches are even decoded.
Since basic blocks can easily be smaller than the issue width of near
future systems multiple branch prediction is important.
An early solution to the problem is the multiple branch predictor
(MBP) of Yeh, Marr, and Patt. A PC-indexed branch address cache (BAC)
provides information on all basic blocks within a number, say 3, basic
blocks of the PC; a path through the blocks is chosen by a
global-history branch predictor.
The published evaluation did not include overall performance numbers,
a shortcoming shared by many subsequent studies of similar systems.
To fill this gap dynamically scheduled processors using a MBP or a
conventional mechanism were simulated using RSIM. For comparison
a multiple branch predictor based on superblocks, described by
Reinman, Austin, and Calder was also simulated.
Compared to single branch prediction, 8-way superscalar MBP systems
show speedup of 10% on some SPECcpu integer benchmarks on a
resource-constrained system, the superblock predictor realizes an
improvement of 8%. MBP tree depths of 2 and 3 show nearly equal
performance. As one would expect, MBP systems fill more decode slots,
but data show that much of the performance gain is due to providing
more time for dependent integer instructions. Much smaller are gains
due to early initiation of memory operations (except when using fast
caches) and a less frequently empty reorder buffer.
Presentation (455 kB PDF)
Hide Abstracts
Neighborhood Prefetching On Multiprocessors Using Instruction History
David M. Koppelman,
International
Conference on Parallel Architectures and
Compilation Techniques,
October 2000, pp. 123-132.
A multiprocessor prefetch scheme is described in which a miss is
followed by a prefetch of a group of lines, a neighborhood,
surrounding the demand-fetched line. The neighborhood is based on the
data address and the past behavior of the access instruction that
missed the cache. A neighborhood for an instruction that misses the
cache is constructed by recording the offsets of addresses that
subsequently miss. This neighborhood prefetching can exploit
sequential access as can sequential prefetch and can to some extent
exploit stride access, as can stride prefetch. Unlike stride and
sequential prefetch it can support irregular access patterns, for
example, accessing some subset of a structure's members.
Neighborhood prefetching was compared to adaptive sequential
prefetching using execution-driven simulation. Results show more
useful prefetches, and lower execution time, for neighborhood
prefetching for six of eight SPLASH-2 benchmarks. On eight SPLASH-2
benchmarks the average normalized execution time is less than 0.9,
for three benchmarks, less than 0.8.
Longer Version (211 kB PDF)
Presentation (272 kB PDF)
Hide Abstracts
Using Predictions of Non-Regular Memory Access Patterns
to Improve Multiprocessor Cache Performance (Neighborhood Prefetching)
David M. Koppelman,
LSU ECE Department Seminar, March 1999.
Computer engineers are suffering in a hell partly of their own creation:
while the time needed to access memory has slowly dropped, the number
of instructions that can be executed during that time has increased
exponentially. Since execution must sometimes wait for the outcome of
a memory access, efforts at improving instruction execution speed are
increasingly futile. This memory access latency can be partly avoided
by caching data in a small, expensive, high-speed memory.
Memory latency is still a problem because needed data is not always in
the cache.
An aggressive solution to the problem is prefetching: bringing data
into the cache before it is needed using predictions of memory access
addresses. Prefetch schemes reported in the literature predict
consecutive or arithmetic address sequences, for example, 1000, 1025,
1050, 1075, etc. A more flexible scheme is described here: a
neighborhood is found for memory access instructions based on their
past behavior. Address predictions are made using a neighborhood, these can form
sequential or stride sequences, but can also form irregular sequences,
increasing the number of useful prefetches and reducing the number of
useless ones.
This neighborhood prefetching was evaluated by execution-driven
simulation of shared-memory parallel computers (multiprocessors)
running SPLASH 2 benchmark programs. Performance improvements
of up to 25% were obtained on neighborhood prefetching schemes
having storage requirements of only 7% that of the cache itself.
Hide Abstracts
Speculative Multiprocessor Cache Line Actions Using Instruction and Line History
David M. Koppelman,
The
10th International Conference on Parallel & Distributed Computing Systems,
New Orleans, Louisiana, October 1997, pp. 401-406.
A technique is described for reducing miss latency in coherent-cache
shared-memory parallel computers. Miss latency is reduced by
speculatively invalidating and updating (copying an exclusively held
line back to memory) cache lines at one processor that might be needed
at another processor.
A line becomes a candidate for speculative invalidation when another
line last accessed by the same instruction is invalidated. A line
becomes a candidate for speculative updating under corresponding
conditions.
The technique can be implemented by constructing linked lists of lines
for each memory access instruction. The amount of memory needed by an
implementation is less than 50% the size of the cache, roughly
comparable to the cost of adding error-correcting memory. No time
need be added to cache hits. In execution-driven simulations of such
systems running programs from the Splash 2 suite invalidations and
updates are reduced by 50% or more. Total time spent servicing
misses is reduced by about 20% on 16-processor systems, execution
time is reduced as much as 20% for some benchmarks on high latency
systems.
Presentation (225 kB PDF)
Hide Abstracts
A Multiprocessor Memory Processor for
Efficient Sharing and Access Coordination
David M. Koppelman,
Workshop on Mixing Logic and DRAM, International
Symposium on Computer Architecture, June 1997.
The growing disparity between instruction issue rates and memory
access speed impacts multiprocessors especially hard under certain
circumstances.
To alleviate the problem a system is described here in which smart
memory chips can execute simple operations so that certain tasks can
be completed with less contention, fewer messages, or by avoiding
synchronization that would otherwise be necessary.
These operations, issued in only a few cycles by a CPU, direct the
memory to read, modify, and write a memory location. Tag values can be
used to delay completion of an operation until the needed tag value is
set. Operations can have multiple steps, and can be sent between
memories to complete. A completion counter at the issuing CPU can be
used to wait until in-progress operations have completed.
Demonstrating their usefulness, execution-driven simulation of such
systems shows speedup of well over two times on code fragments chosen for
their suitability. The radix sorting program from the SPLASH-2
benchmark shows over 29% reduction in execution time.
Presentation (319 kB PDF)
Hide Abstracts
Sticky States in Banyan Network Queues and Their Application to
Analysis
David M. Koppelman,
The Journal of Parallel and Distributed Computing,
vol. 40, pp. 147-161, 1997.
A finite-buffered banyan network
analysis technique designed to model networks at high traffic loads is
presented. The technique specially models two states of the network
queues: queue empty and queue congested (roughly, zero or one slots free).
A congested queue tends to stay congested because packets
bound for the queue accumulate in the previous stage. The expected
duration of this state is computed using a probabilistic model of a
switching module feeding the congested queue. A technique for finding
a lower arrival rate to an empty queue is also described. The queues
themselves are modeled using independent Markov chains with an
additional congested state added. The new analysis technique is novel
in its modeling the higher arrival rate to a congested queue and a lower
arrival rate to an empty queue. Comparison of queue state distributions
obtained with the analysis and simulation show that an important
feature of congestion is modeled.
Hide Abstracts
A Family of Networks for Non-Uniform Traffic
David M. Koppelman,
IEEE Transactions on Parallel and Distributed Systems
vol. 7, no. 5, pp. 486-492, May 1996.
New networks, called GLO networks, are constructed by adding
bus-like links to omega networks, providing additional
capacity between cells on momentarily busy paths. Equivalent
pin-count GLO and omega networks offered uniform and nonuniform
traffic were simulated. GLO networks exhibited lower latency for
nonuniform traffic and light to moderate uniform traffic.
Hide Abstracts
A Lower Bound On the Average Physical Length
of Edges in the Physical Realization of Graphs
David M. Koppelman,
Parallel Processing Letters, vol. 6, no. 1, pp. 137-143, 1996.
[Summary of introduction.]
A graph realization is an assignment of positional and other
information to the elements of a graph; it can be used to bound the
performance of an interconnection network based on the graph. Vitanyi
describes a graph realization where processors have bounded volume; a
lower bound on the average physical edge length is derived for such
graph realizations. The bound is based on the graphs diameter,
symmetry, and path distribution. The bound is easily applied to many
popular graphs such as trees, d-dimensional toroidal meshes, and
cube-connected cycles. The bounds are not easy to apply to
less-symmetric graphs, in particular the de Bruijn and
shuffle-exchange graphs. A modified version of Vitanyi bound, easily
applied to the de Bruijn and other graphs, is presented. Also
presented is a more precise realization definition.
Hide Abstracts
Congested Banyan Network Analysis Using Congested-Queue States
and Neighboring-Queue Effects
David M. Koppelman,
IEEE/ACM Transactions on Networking,
vol. 4, no. 1, pp. 106-111, February 1996.
A banyan network analysis technique is
presented which more accurately models a congested network than other
reported techniques. The analysis is based on the observation that a
full queue (within the switching modules making up the network) causes
traffic to back up. For a short time after becoming full the free
space in the queue is limited to no more than one slot. A queue in
such a condition is called congested. Because of blocking the
arrival rate to a congested queue is higher; this tends to maintain
congestion. The arrival rate to a congested queues dual is lower as
is the service rate for queues feeding the congested queue. These
effects are captured in the analysis. The state model used for a
queue encodes congested as well as normal operation. Further, the
model codes whether the connected next-stage and dual queues are
congested. Network throughput computed with the model is closer to
that obtained from simulations than other banyan analyses appearing in
the literature, including those designed to model congestion.
Further, the queue-occupancy distributions are much closer than other
analyses, suggesting that the analysis better models conditions in
congested banyan networks.
Hide Abstracts
Related technical report
Congested Banyan Network Analysis Using Congested-Queue States and
Neighboring-Queue Effects, David M. Koppelman,
Technical Report, Louisiana State
University, Baton Rouge, LA, September 1995
Manuscript (275 kB PDF)
An Analysis of Banyan Networks Offered Traffic
with Geometrically Distributed Message Lengths
I-Pyen Lyaw and David M. Koppelman,
IEE Proceedings - Communications, vol. 142, no. 5, pp. 285-291,
October 1995.
An analysis of finite-input-buffered banyan networks offered traffic
having geometrically distributed message lengths is presented. This
is one of the few multistage-network analyses for networks offered
non-unit-length messages and is the only one that the authors are
aware of for finite-input-buffered banyan networks. In the analysis,
network switching elements are modeled using two state-machines, one
for queue heads (HOLs), the other for entire queues. A network is
modeled using one switching-element model to represent each stage.
Together these model the effect that non-unit-length messages have on
banyans. Solutions are obtained iteratively. Network performance
figures were obtained with this analysis and compared to simulation
results. The figures show that the analysis can predict the effect of
message length on throughput and delay, including the performance
degradation caused by longer messages.
Hide Abstracts
Reducing PE/Memory Traffic in Shared Memory Multiprocessors by the
Difference Coding of Memory Addresses
David M. Koppelman,
IEEE Transactions on Parallel and Distributed
Systems, vol. 5, no. 11, pp. 1156-1168, November 1994.
A method of reducing the volume of data flowing through the network in
a shared memory parallel computer (multiprocessor) is described. The
reduction is achieved by difference coding the memory addresses in
messages sent between processing elements (PEs) and memories. In an
implementation each PE would store the last address sent to each
memory and vice versa. Messages that would normally contain an
address instead contain the difference between the address associated
with the current and most recent messages.
Trace driven simulation shows that only 70% or less of traffic volume
(including data and overhead) is necessary, even in systems using
coherent caches. The reduction in traffic could result in a lower
cost or lower latency network. The cost of the hardware to achieve
this is small, and the delay added is insignificant compared to
network latency.
Hide Abstracts
Reducing PE/Memory Traffic in Shared
Memory Multiprocessors by the Difference Coding of Memory Addresses
David M. Koppelman,
in International Conference on Parallel Processing, August
1992, vol. I, pp. 53-56
Hide Abstracts
A Self-Routing Permutation Network
David M. Koppelman and A. Yavuz Oruç,
Journal of Parallel and Distributed Computing,
vol. 10, no. 1, pp. 140-151, 1990.
A self-routing permutation network is a connector which can set its
own switches to realize any one-to-one mapping of its inputs onto its
outputs. Many permutation networks have been reported in the
literature, but none with the self-routing property, except crossbars
and cellular permutation arrays which have excessive cost. This paper
describes a self-routing permutation network which has O(log³n)
bit-level delay and uses O(n log³ n) bit-level hardware where n is
the number of inputs to the network. The network is derived from a
complementary Benes network by replacing each of its two switches in
its first stage by what is called a 1-sorter, and recursively defining
the switches in the third stage as self-routing networks. The use of
1-sorters results in substantial reduction in both propagation delay
and hardware cost when contrasted with O(n) delay and O(n1.59)
hardware of the recursively decomposed version of a complementary
Benes network. Furthermore, these complexities match the propagation
delay and hardware cost of Batchers sorters (only networks, other
than crossbars and cellular permutation arrays, which are known to
behave like self-routing permutation networks.) More specifically, it
is shown that the network of this paper uses about half of the
hardware with about four-thirds of the delay of a Batchers sorter.
Hide Abstracts
A Self-Routing Permutation Network
David M. Koppelman and A. Yavuz Oruç,
in International Conference on Parallel Processing,
August 1989, vol. I, pp. 288-295Hide Abstracts