User-Visible Changes to Proteus and Related Programs

10 January 1998, 15:13:11 CST, Proteus Version L3.13

Proteus Version L3.13 includes several bug fixes, plus some new features in stats and minor improvements elsewhere.

For now, the changes are only documented here.

Bug Fixes

Timer Interrupts and Quantum Expiration

Before version L3.13, timer interrupts would only occur on idle processors. This could severely distort the timing of any program which used timer interrupts while a thread was active or which has more than one ready thread per processor. The bug was identified and fixed by Pallavi Ramam, ramam@a.cs.okstate.edu.

Deadlock in Nonblocking Operations

Nonblocking operations could deadlock when switching between threads on a single processor. Fixed in L3.13.

Thread Sleep

Internal errors can result when the target of a thread_sleep call is not the thread making the call. The bug has been fixed and thread_sleep can now be called for a suspended thread.

Shared Memory Access Functions Used With Host Address

Segmentation faults could occur when shared memory access functions, such as Shared_Memory_Read are used on non-shared addresses. Errors would not occur with augment-inserted calls. They would occur with calls made in user code or with catoc-inserted calls. The bug has been fixed.

Miss Latency Data in Bus Configurations

A runtime error would result when miss latency statistics collection is enabled (WATCH_MISSES) in bus configurations. The bug has been fixed.

Proteus Changes

Thread Sleeping While Suspended

A suspended thread can now be changed to the sleep state.

Snapshot Improvements

Snapshot now shows additional data. The root procedure for all threads is now shown. Sleeping threads are shown in a separate list along with their wakeup times. (Thanks to Pallavi Ramam, ramam@a.cs.okstate.edu.) The number of active and ready threads given is now correct.

Event Files

RTI values are now included in the event file. The modification time for Proteus files is now included in the event files. (See stats changes.)

New Functions

Two minor functions were added. meta_add(iptr,amount) adds amount to the integer pointed to by iptr and returns the sum. The funciton is implemented in non-cycle counted code. It's intended to provide atomicity for instrumentation code.

Stats Changes

Perdecile and Perhexile Divider Selection Changes

The divider values chosen by selecting "Value Perhexile" and "Sample Perdecile" from the Map menu are now based on the current view. (Previously they were based on the full view.)

RTI and Parameters Available From Stats

RTI (run-time initialized variables) and parameters (including the status of Proteus files) are now available from stats. They are obtained by selecting "Select Parameters" or "Select RTIs" from the Edit menu, which makes them the current selection. They can then be pasted into another program, such as a text editor.

Stats Divider Select and Paste

Dividers used in event graphs can now be selected and pasted. To select, choose "Select Dividers" from the Edit menu. The divider values can then be pasted into another program. If another instance of stats is running showing the same graph but, say, from a different simulation run, then the selected divider values can be pasted by choosing "Set Dividers" from the Edit menu. With identical divider values, the two graphs are easy to compare. Using these menu options dividers can be selected, pasted into a text editor, changed (perhaps to place dividers at more revealing locations), selected in the text editor, and then pasted back into stats.

Stats Linking

Two or more instances of stats can be linked so that the graphs and divider values displayed by one will also be displayed by the other. With at least one other stats running, choose "Link Stats" from the Map menu of stats. The next most recently accessed instance of stats will now be linked. Graph changes and divider changes in the current one will also occur in the linked one.

Stats Keyboard Commands

Additional commands can be issued from the keyboard. The keys are shown next to the menu items used to issue the commands.

Cumulative Bin Distribution

Stats event graphs legends now optionally show for each bin the number of values in that and all smaller bins (a cumulative distribution).

Frame Title

Stats frame titles now show the event file name and the time of the simulation run.

Compressed Event Files

If an event file ending with ".gz" is specified it will automatically be uncompressed using gzcat (which must be found on your system).

19 August 1997, 18:49:03 CDT, Proteus Version L3.12

Proteus Version L3.12 fixes several bugs in L3.11, plus adds a few minor features, most notably, a miss latency graph.

Bug Fixes

The bugs described below have been fixed.

Shrinking TLB

In L3.11 under certain circumstances the capacity of a TLB would shrink, increasing the miss rate.

VM With Partially Simulated Caches

In version L3.11 VM would not work when CACHEWITHDATA was undefined. (When undefined, the simulator would not allocate memory for caches, reducing the amount of host memory needed. This could not be used in combination with certain types of memory accesses, for example the partially documented-and-unsupported soft accesses.)

Getting-Lock State

In and before L3.11 the "Getting Lock" states in the state graph displayed by stats might be shorter than the actual getting lock period. This would happen if more than one thread on a processor were simultaneously getting a lock (not necessarily the same one).

VM Page Table Statistics

In L3.11 the page table statistics function, print_memory_statistics_mod, would report errors when the page tables were in fact okay; it might also go into an infinite loop. Both problems were rare. The function is used to display VM information in snapshot and is automatically called after the VM system is initialized.

Reading the Current Time

In L3.11 and earlier versions an assembler error would result when building code that assigned the time to a floating point variable. For example:

double time; time = CURRENTTIME;

Cache Hit Ratio

In L3.11 the cache hit ratio reported is slightly off when more than one memory access could be simultaneously active per thread. In L3.12 the cache hit ratio is accurate.

New Graph

A miss latency graph has been added. The miss latency is cache-centric: it starts when the cache controller is presented a request which cannot be satisfied (either due to no data or wrong state) and ends when the data arrives. (A request is presented to the cache when it reaches the head of the completion buffer and also, if immediate fetches are enabled, when the instruction is issued.) Miss latency does not include virtual-to-physical translation time.

Random Number Generator Seeding

An RTI variable random_seed has been added for seeding the random number generator. When Proteus starts it determines a seed for the random number generator by examining the command line option "-seed" and the RTI variable random_seed. If both are zero it sets random_seed to the time (seconds since 0:00, 1 January 1970 UTC). If the -seed option is non-zero that is used for the seed otherwise the initial value of RTI random_seed, if nonzero, is used. The cycle-counted random number generators are not seeded, but random_seed is visible to user code. The seed used is printed at the beginning of each run with the configuration data and appears as /the title/ of a metric in the metrics graph displayed by stats. (Displaying the random number as a title rather than a value is a kludge used because values are not entered in to the event file until the end of a run, which isn't helful when debugging programs that don't make it that far.)

4 August 1997, 10:04:52 CDT, Proteus version L3.11

Proteus Version L3.11 has several major enhancments including virtual memory simulation, nonblocking writes (with and without read ordering), and adjustable issue rates. There are also many minor enhancments, including more source information from snapshots and graph data within event files.

Virtual Memory

Proteus version L3.11 simulates virtual-memory systems. Simulated systems use a fully associative translation lookaside buffer (TLB) and a single-level page table. Caches are physically mapped, selection of physical pages can use bin-hopping or color matching schemes. The selection scheme, page and TLB sizes are user set-able. Currently, the virtual address space is the same size as the physical address space, so the advantage is greater realism in simulating TLB misses, but not paging. (Paging to and from disk can easily be added, and probably will be in a future release.)

TLB miss handlers are implemented in simulated code, so TLB miss delays include the time to access the page table, including cache and additional TLB misses.

Virtual pages can be moved between processors, even while they are being accessed. Using a directory poisoning scheme, an attempted access to a page being moved will trigger an interrupt handler which ultimately will update the TLB with the new location (when available). Once the TLB updated, the thread resumes and the access completes.

Page selection can use either color matching or bin hopping.

Functions for accessing shared memory without disturbing the simulated system are now provided. Memory locations can be accessed if the data is in a memory module, dirty in a cache, or even in the network.

For details, see the VM section under the Memory chapter of the add-on documentation.

Statistics

A graph showing the TLB hit ratio has been added to stats, and a state, "Page Relocation" has been added to stats' state graph. The page relocation state is active when a page is being copied between processors.

VM statistics are also printed at the end of a run, included is the TLB hit ratio, the number of pages copied, and the number of times a freed page was reused.

Dynamic Memory Allocation

Dynamic memory allocation (when VM is used) uses a modified version of the malloc and free routines that are part of the GNU C library, renamed sm_malloc, and sm_free. They are called in the same way as ordinary dynamic memory allocation routines. The same OS_getFOO routines can be used, but when VM is used these will ultimately call vm_set_home and sm_malloc for memory allocation.

On non-VM systems, the OS_getMem routines now run on the simulated system. (Before, they allocate instantly and without communication.)

Simulator Implementation Notes

OS code, including page table initialization, page allocation, and the TLB miss handler, are in file os_vm.ca, and are processed by catoc and augment in the normal manner.

The page table initialization code is awkward, due to mostly to the lack of a complete simulation of physical addresses. (It would be simple enough to add.)

Nonblocking Writes

Proteus version L3.11 includes non-blocking writes. When appropriate, store (write) instructions complete in the time needed for an integer instruction, even if they miss. The memory system will complete the writes as the following instructions execute. The execution of a load (read) following a store depends upon the memory model and whether the needed data is being written. If the needed data is being written (by the same processor), the load suffers only an instruction delay. If the data is not being written and ordered reads have been selected, the load must wait for all previous stores to finish before it can complete. Alternately, if ordered reads is not selected, loads can start immediately.

When ordered reads is selected, the memory model is sequentially consistent.

A function, Shared_Memory_Wait, is provided that blocks until all pending accesses complete.

Adjustable Instruction Issue Rates

In version L3.11 the issue rate of the simulated CPU can be adjusted, albeit in crude fashion.

Since Proteus does not perform detailed microarchitecture simulation, one cannot simply specify an issue rate for the simulated system, such as 4 for a 4-way superscalar system. Instead, an effective issue rate is used, which should include the effect of branch stalls and resource conflicts, but which should not include cache stalls, which are of course simulated. The effective issue rate is specified by variable EFF_ISSUE_RATE in cpu.param.

Before version L3.10, most instructions had a weight of 1 (even floating point divide), and the number of cycles to execute a block was simply its weight. Other simulation parameters would have to be scaled to improve realism. In version L3.11 instruction weights are based on the UltraSparc-II processor, with a weight of 12 for single-precision fp divide and square root, and 22 for double precision; all others are one.

New Semaphore Functions

Two new semaphore operations have been added:

int sem_IsLocked(Sem addr) Returns non-zero if semaphore ADDR is locked.

void sem_Wait(Sem addr) Will block until semaphore ADDR is unlocked, but will not attempt to get the lock. It's possible that the lock can be acquired again before the function returns.

Stats Changes

Event Scaling

Stats can now scale the events that it displays. Though events found in the event file are integers, they sometimes describe quantities usually expressed as floating-point numbers. For example, access latency is the total latency for the last 100 accesses, not the average. Stats can now scale events displayed in graph keys, for example dividing the access latency "per 100 requests" found in the event file by 100. A scale is specified by using a "scale" entry in the graph specification and a number format is specified by a "format" entry.

scale < - scalevalue Specifies a floating-point number by which to multiply numbers used in the bin ranges of ArrayGraph legends. The default is 1.

format < - formatspec Specifies the format used to display bin ranges in ArrayGraphs. With an empty string, the default, an integer format with commas between groups of three digits is used. Anything else is used as a format in a call to sprintf, with the value a double. (For non-C programmers, "%.2f" indicates a floating-point number with two digits past the decimal point, "%e" specifies exponential notation, and "%g" specifies a general format.)

In the graph specification below, a 0.01 scale is used for access latency which, as described above, uses events which are the sum of the last 100 accesses. With scale, an average is displayed, the format specifies one digit past the decimal point.

< pre> ArrayGraph accesslatency (x, 0, NO_OF_PROCESSORS -1) { menu < - "Access Latency", name < - "Memory Access Latency", scale < - 0.01, format < - "%.1f", y_axis < - "Processor", x_axis < - "Time / Cycles", action { EV_MEMACC_LATENCY: VALUE(x) } } < /pre>

Additional Information

Stats now displays the average of visible events in ArrayGraphs (applying the scale described above).

Graph Specification Within Event File

Stats will now look for graph specification data in the event file, in addition to the usual graph file. If a graph is defined both in the graph file and the event file, graph file settings take precedence. When stats is run, the graph information found in the event file is written to file GraphfileTemp.

Graph specification data is placed in the event file using the function GraphSpec:

event.h void GraphSpec(char *spec) Write null-terminated string to event file, to be processed by stats.

This feature is in an early state of implementation and subject to change. (That's why there's no way yet to automatically put state event names in the state graph legend.)

Statistics Collection

Utilization Statistics

Utilization statistics collection can now be automatically started at three times: when the simulation starts, just before user code starts, or when the user calls UtilSetOn. These are controlled by a new run-time initialized variable, utilSwitch.

Run Time Initialized Variable utilSwitch Utilization statistics switch. When 0 utilization statistics collection must be activated by the user (by calling UtilSetOn). When 1, collection starts just before usermain is called. When 2, collection starts at the beginning of the simulation.

Comment Strings

Three run-time initialized variables are available for printing comment strings at the beginning of a run.

Run Time Initialized Variable char* sim_comment_1 char* sim_comment_2 char* sim_comment_3 Pointers to comment strings. If set, the strings are printed at the beginning of a run.

Average of Histometric

Averages of histogram metrics can be obtained using the function histoMetricAverage, which should be called only at the end of a run.

double histoMetricAverage(int metric); Return the average (statistical mean) of all samples in all sets of METRIC. Should be called after all data is collected.

Source Information

A function returning the host address of its call, and a function to convert a host address to source information have been added.

void *HERE(); Returns the address of the caller. Code inlining compiler optimizations will confuse this routine.

char *AddrToSource(void *addr); Returns the source file and line number corresponding to host address ADDR. The source should be compiled using a debugging option.

Example:

#include "event.h"

printf("This message generated at %s, called from %s.\n", AddrToSource(HERE()), AddrToSource(WHENCE));

State Event Duration Threshold

A duration threshold can be set for state events, if the state is active for strictly less than the duration, it is not written to the event file. Thresholds can be set for individual states and there is a default threshold.

int stateSetResolution(S_EV_Type event, int threshold) Sets the threshold for EVENT to THRESHOLD. Can be called any time after stateRegister, and can be called multiple times. (For example, to record high-resolution event data for some brief period of time.)

Run Time Initialized Variable sev_default_resolution The default duration threshold for state events.

Snapshot Improvements and Fixes

The dump memory option in snapshot's shared memory menu now works.

The snapshot overwrite check option now works for both shared and host memory.

Snapshot's thread information now includes a stack trace.

Snapshot can read time values that include commas. For example, both 1,000,000 and 1000000 are acceptable inputs.

Snapshot now includes a hidden option: X< stid> will perform a context switch to thread STID. The option only works if snapshot was called from engine mode and its only purpose is to context switch for a debugger.

18 April 1997, 19:01:12 CDT, Proteus version L3.10, augment

Installation Bug Fix

Improved installation of libraries. Now, all Proteus cycle-counted functions are used; previously only those library functions with counterparts in the system libraries were included. This change will avoid hard-to-find runtime errors that occur when such functions are called in simulator code ("ENGINE" mode) (in contrast to code running on the simulated systems ["USER" mode], where these functions would not cause problems).

Improved Error Detection and Messages

Cycle-counted functions (library and user) now test if they are started in the proper mode. If not, a fatal error is issued.

A call stack is now printed after fatal errors.

18 April 1997, 18:58:07 CDT, stats version L3.43

Font Bug Fix

Because of a bug, earlier versions of stats might end up using narrow Helvetica fonts. These are difficult to read and worse, would not work in the Postscript output. The bug is now fixed.

26 February 1997, 17:27:03 CST, stats version L3.42

Incomplete Event Files

Stats can now read incomplete event files; this is useful for simulations that end abnormally or are in progress. (Gee, I wonder if it's in an infinite loop?)

Views

The current and total number of views is now displayed in the upper left-hand-corner of the graph.

The "x" values of views are preserved when switching to compatible graphs. For example, if you zoom in on cycles 1000000 to 1200000 in the states graph and then switch to the access latency graph, that same range will be visible.

Zoom-to-Processor

Double clicking on a row (typically displaying data from a particular processor) in an event graph will zoom the graph to display only that row.

Keyboard Navigation

The keyboard can now be used to pan, zoom, and switch between views. The key bindings are:

[left] Pan (user's view) left 20% of current view. [S-left] Pan left 4% of current view. [right] Pan right 20% of current view. [S-right] Pan right 4% of current view. [up] Pan up 20% of current view. [down] Pan down 20% of current view. [pageup] Pan to maximum y value. [pagedown] Pan to minimum y value. [home] Pan to minimum x value. [end] Pan to maximum x value. [C-up] Zoom out. [C-down] Zoom in. [M-left] Previous view. [M-right] Next view. [M-up] First view (entire graph).

where, [S-up] indicates shift-up-arrow, [C-up] indicates control-up-arrow, [M-left] indicates meta-left-arrow, etc. (The meta key may be labeled with a black diamond, ALT, or something else.)

Sometimes these pan and zoom commands change the current view, sometimes they increment the current view and apply there. (The View menu or the [M-left] and [M-right] keys can be used to switch between views.) The rules are:

General rule: if it's not reversible it creates a new view.

The first view always displays the entire graph so any pan or zoom in the first view will modify or initialize the second view.

Keyboard zooms affect the current view. Mouse zooms (double clicking or dragging) always create a new view.

Arrow-key pans effect the current view, other keyboard pans create new views.

Bin Occupancy Data

The key for event graphs can now indicate the number or percentage of events in the current view within each range. For example, the states graph now indicates the amount of time idle, waiting in a barrier, etc. This information can be switched off from the options menu.

PostScript Output

PostScript output now closely matches the displayed graph. New standard sizes are available (identified by size rather than their intended use) as well as PostScript matching the current window size. The PostScript generation code does not prevent text from overlapping. If it does, either shorten the text by editing the PostScript file or select a larger plot.

The title, subtitle, and the plot background color can easily be changed by editing the PostScript; comments have been included to guide those making more extensive changes.

(For those familiar with PostScript, the title and subtitle can even by changed by defining keys before the graph file itself is read within some other environment. One could, say, specify the title in a TeX file processed by dvips using \special{!/stats-title (Experiment 1 Results) def}, though this would set the title of all included stats graphs.)

A useful feature has been removed. It is no longer possible to generate PostScript using command-line arguments.

28 December 1996, 10:35:20, Version L3.9

Solaris

LSU Proteus now runs on Solaris (a.k.a. Solaris 2 and SunOS 5.5) and SunOS 4.1.X (a.k.a. Solaris 1). A single distribution is used for both.

User Code Parsing

Catoc can handle more ANSI C and GNU C syntax, enough so that the include files normally used by gcc can now be used by catoc.

The user code parsed by Proteus can now include type qualifiers (i.e., volatile and const) within declarators, for example:

int * const * volatile * ptr;

Types consisting of two type names (such as long long) are now parsed but their use for shared variables is only partially supported, as follows. Two-type-name types cannot be used for catoc-dereferenced shared pointers; long doubles cannot be used for catoc- or augment-dereferenced shared pointers.

For example,

long double *ldp; long long *llp1,*llp2; ... ldp = some_nonshared_ptr; /* Assign a normal (non-shared) address to ldp */ *ldp = 1.0; /* Always works. */ ldp = some_shared_ptr; /* Assign a shared address to ldp. */ *ldp = 1.0; /* Never works. */ @ldp = 1.0; /* Never works. */

llp = some_nonshared_ptr; /* Assign a normal (non-shared) address to llp */ *llp = 1; /* Always works. */ llp = some_shared_ptr; /* Assign a shared address to llp. */ *llp = 1.0; /* Always works. */ @llp = 1.0; /* Works if the -noat option used with augment (the default). */

GNU C-style attributes are now parsed but their meanings are ignored by catoc. For example,

char my_array[1024] __attribute__ ((aligned 4));

is now accepted by catoc which will pass the "__attribute__ ((aligned 4))" to the C compiler. In the definition

typedef struct { short s; char c[1024] __attribute__ ((aligned 8)); } SA;

SA *sa;

the "__attribute__ ((aligned 8))" is passed to the C compiler as before, but since catoc does not recognize the aligned 8 attribute, (which specifies that & sa-> c[0] must be aligned on an 8-byte boundary), catoc dereferenced shared pointers to the "c" member of this type may not work properly.

Cycle-Counted Library

Two string functions, strchr and strprk, have been added to the cycle counted library.

The scalb math library function has been renamed scalbn, which is consistent with the SunOS 4.1.X and Solaris 2 libraries.

For those library procedures which are not cycle-counted augment can now use an estimated cycle count.

Augment, the program which processes assembly language compiler output generated from user code, normally replaces each library function call with a call to a cycle-counted version of the function. If a cycle-counted version is not available the call is untouched and a warning is issued, for example: "Warning, printf is not cycle counted." In the system thus simulated the function will complete in zero cycles. For greater realism, cycle-counted versions should be available for all library functions used in the program portion being studied. Before L3.9 library functions without cycle-counted versions would have to be removed from user code or cycle-counted versions would have to be written. In L3.9 there is a third option: augment can use an estimated number of cycles for such functions.

Two files are used to specify the estimated cycle count for functions, PROT_HOME/lib/Cycles and USER_DIR/Cycles, where PROT_HOME is Proteus' root and USER_DIR is were the user's files are located. The format of both files is identical: each line of the file contains either a comment or an estimated cycle count. A line specifies an estimated cycle count if its first non-whitespace character is not a "#", otherwise it is considered a comment. An estimated-cycle-count line consists of a procedure name, whitespace, a number, and an optional tail comment. The tail comment must start with a "#". The number is the number of cycles (or more precisely, instructions) which the simulator should use each time the function is called. See PROT_HOME/lib/Cycles for an example.

The file PROT_HOME/lib/Cycles currently contains cycle counts for "simulated instructions", functions corresponding to instructions that are not included in all SPARC implementations.

12 October 1996, 16:14:09, Stats version L3.41

Overlapping Samples in Color Graphs

An average is now used to deterine a color for overlapping samples in color graphs. Samples are overlapping if they would be mapped to the same pixel; before L3.41 stats would use the value of the latest sample, now an average of the overlapping samples is used in graphs without named maps.

Default (color) map renamed to "Linear".

Minor changes and bug fixes.

10 October 1996, 14:06:47, Stats version L3.40

"Y" Value Scales on Color Graphs

Two new Y value scales have been added to color graphs. (The Y value scales determine how event values are mapped to colors in color graphs. "States" and "Net Contention" are color graphs; in States the value identifies the states and in Net Contention the value identifies the total delay encountered by the last 100 messages passing through a network node.) In earlier versions of stats values could be mapped to up to 16 colors in two ways, linearly and based on a hundred-bucket histogram (AutoColor). With linear mapping colors are linearly divided between the range of values; with AutoColor a hundred-bin histogram is constructed with bins spread over the range of values. Colors are assigned to bins so that each color covers approximately the same number of values. When at least one value is much larger than most others both linear mapping and histogram mapping provides low resolution for the smaller values.

Two new mappings use percentile rankings of data points. In "Value Perhexile" mapping a randomly chosen thousand-element subset of values is chosen and sorted and duplicate elements are removed. Then colors are assigned by element positions. (That is, if there are 160 unique elements, the first color is assigned to all values, less than the 10th, the second to remaining values less than the 20th, etc.) In the "Perdecile" mapping a thousand samples are selected and sorted and then colors are assigned as above; unique elements are not removed. Available mappings are now selected using the "Map" menu.

Density Information in Color Graphs

Color graphs can now display density information, that is, the relative number of samples per time. Density is indicated by the height of the bars in color graphs, with the minimum height 1/5 the maximum height. For example, if a bar is at its maximum height then that segment has no fewer values than any other bar in the row. Lower height indicates fewer samples. Currently, a row is divided into 30 segments. The relative number of samples in a row is indicated by a blue bar to the right of the graph area. By default, density is off for color graphs with named maps (such as States) and on for other color graphs. The "Event Density" option under the "Options" menu can switch density information for the two graph types on or off individually.

View Behavior

User-selected views (zooms) are now maintained until the graph is changed or another view is selected. The "Forward" and "Back" View-menu options have been changed to "Next" and "Previous". (These refer to selected zooms.)

Default Colors

A new set of default colors is now used in color graphs without user-defined maps. The new colors make it easier to determine color ordering (at least on my workstation :-) ), although perhaps more difficult to distinguish a color from its neighbor.

User-Defined Colors

Users can define their own colors (in terms of RGB values) for use in user-defined color maps. See the Graphfile included with the engine code for an example.

Trace File

When command line option "-f -" is given the trace file will be read from standard input. Using this option, trace files can be kept in compressed form and decompressed into stats:

gzcat events.sim.gz | stats -f -

"Hardcopy"

Density information is not included in hardcopy formats (such as PostScript).

4 October 1996, 14:16:40, Version 3.8.1

Minor bug fixes only.

30 September 1996, 19:26:15, Version 3.8

Shared Memory Dereferencing

The operators used in Proteus C code to dereference shared pointers are now the same as the regular dereference operators, *, -> , and foo[bar]. That is, it is no longer necessary to change * to @, -> to @> , and (best of all) foo[bar] to @&foo[bar].

The shared memory function calls are now inserted by augment; before Version 3.8 they were inserted by catoc. Apart from avoiding the tedium of changing dereference operators, these changes produce much more realistic code since the compiler "sees" dereferences, not function calls. The compiler would save volatile registers around function calls and augment would count these saves as being part of the program, inflating the cycle count. Also, the compiler can perform optimizations that would not be possible if it saw function calls.

To avoid the hassle of changing @'s back to *'s in already-ported code, catoc has an option "-noat" which instructs catoc to translate shared memory dereferences back to regular dereferences.

Atomic shared memory operations are not inserted by augment. That is, Increment(shared_ptr) and similar operations are still implemented using a function call.

N.B. When implemented using function calls, all shared memory accesses were treated as volatile. With the new scheme, the volatile type qualifier is sometimes necessary.

Cycle Counter Reading

The cycle counter (CURRENTTIME) can now be read anywhere. Before 3.8, augment would sometimes reject cycle counter reading in optimized code. (When the read appeared in a branch delay slot.)

Unsigned Shared Reads

Unsigned shared memory operations are now supported. Use "UNSIGNED" in the mode parameter of shared memory reads.

Network Interface Contention Statistics

A new event, network interface (NWI) contention is recorded. NWI contention at a processor is the total amount of time the last 100 messages waited at the switching node before being dispatched (delivered) to the processor (as an interprocessor interrupt or memory operation). A graph has been added to Graphfile to display this data.

Optimization

Four optimization levels are now supported: gcc -O1, -O2, -O3, and no optimization; these are selected using makesim arguments -DOPT, -DOPT2, -DOPT3, and -DOPT0, respectively. (No optimization is the default.)

Annotated Augment Output

Augment provides additional annotation when given the -a option. This is useful for determining how many simulated cycles a segment of code will take. It's also handy for debugging Proteus.

Memory Contention

Shared memory access in the face of contention has been improved. From version 3.3 to 3.8, a backoff delay was added between accesses after a certain number of unsuccessful tries. This virtually guaranteed that once backoffs were used, data brought into the cache would be invalidated before read. Now memory access is attempted three consecutive times before each use of the backoff. (This change only effects code that generates "probable livelock" warnings.)

testProt

Several minor improvements and bug fixes have been made to testProt. Interrupting and timing out again works properly. Changing makesim options (MakesimOptions) or the UserMake file used between tests will cause user object code to be deleted. So, for example, user code will be re-compiled when changing from one UserMake file to another even when no other changes are made. Event files can now be compressed, use the option -ez.

18 July 1996, 18:57:59, Version 3.7 beta 4

Optimization

User code can now be optimized. Optimization is turned on using the new -DOPT makesim option. (The augment program used in previous versions did not correctly handle optimized code.)

The libraries are now optimized.

C-Code Parsing (catoc)

Several catoc bugs, which resulted in incorrect execution given correct C code, have been fixed. In particular, an expression evaluating to a shared address dereference in the lvalue of an abbreviated assignment will now execute just once. For example, in "@a++ /= 3", a is incremented once. Switch statements no longer need to be surrounded by CYCLE_COUNTING_OFF and CYCLE_COUNTING_ON, if using the new optimization option they cannot be surrounded by the macros. Structure offsets using @> are now correctly computed.

There is no compatibility mode for those programs that depend upon the incorrect behavior.

Shared-Memory Allocation

V2 memory allocation routines, which allocate a contiguous block of address space over several memory modules, can now distribute the block of address space in three different ways: contiguous (the method used before 3.7), distributed, and random. As before bytes_needed bytes are allocated on numProcs modules, with > = bytes_needed/numProcs bytes per module. (For simplicity, the elements in prefProc are assumed distinct.) With contiguous distribution bytes_needed/numProcs consecutive addresses (in address order) are allocated on the one memory module, after which allocation proceeds on the the next module (as specified by prefProc). With distributed distribution consecutive pages are allocated on consecutive memory modules (as specified by prefProc). With random distribution pages are placed on randomly chosen modules (using a uniform distribution over those specified in prefProc). By using random distribution very poor performance can be blamed on bad luck rather than an ill-chosen data arrangement. Until documentation is available, see shVmem.c and shVmem.h.

Processor-State Event Generation and Display

Priorities, non-negative doubles, can be assigned to states (which are written to the events file). When several states are active at once, the one with the highest priority is visible. Example usage can be seen in the n-queens program, which now uses state events (search for "marker"). The priority of semaphore (getting lock) and barrier states have been set (to 100) so they would appear over the default user state priority, 50. (In previous versions priorities were assigned on a first come, first served basis, and so user events would obscure barrier and semaphore events.) Until documentation is updated, see event.h for prototypes.

Two additional switches can now be used with the StateEvent procedure used to indicate state event changes. When switch S_EVS_enter is used in a call to StateEvent (parameter 4) a counter for that state and processor is incremented; when switch S_EVS_exit is used, the counter is decremented. The state is active when the counter is positive; using S_EVS_exit when the counter is zero is trapped as a fatal error. These switches can be used in a recursive routine.

New Statistics

Two additional events are collected: memory access latency and network output latency. Memory access latency is the time between starting a memory access instruction (e.g., x=@y) and the time that it completes. Latency per 100 accesses is recorded for each processor. Network output latency measures the amount of time protocol messages generated by the memory and cache controllers wait before entering the network. (Protocol messages include requests for data and their responses, data to be written, invalidation requests, etc.) Appropriate graphs for both have been added to stats and event collection for these can be switched off using config or by editing conf.param.

Hit-Ratio Bug Fix

The correct hit ratio is now recorded in the event file. (In and before version L3.6 the hit ratio data recorded in the event file does not have correct processor attribution. That is, the outcome of a cache access by one processor may be counted towards another processor's hit ratio.)

ANL Macros

The ANL macro definitions have been modified so that the semaphores allocated by LOCKINIT and ALOCKINIT are spread over all available processors. Previously they were all allocated on processor zero. See c.m4.proteus.

The names of functions provided for ANL macro expansion have been changed from m4 to p4. C macros have been provided for source-level compatibility (for code which includes proFun.h).

Minor Changes

Tagged memory can now be turned off by undefining FEBITS in mem.param. When turned off, now the default, Proteus uses almost half the amount of memory. Tagged memory is used by the full/empty-bit semaphores, and can appear in user code.

Unneeded options have been removed from build commands (started from "makesim" command). Build output is now easier to read.

A minor bug in shared memory allocation, which would cause a segmentation fault or bus error, has been fixed.

15 November 1995, 17:38:43, testProt Version 1.0

Variable-initialization entries are now verified. A test script can now be verified without touching files.

New Defaults: Default extension for parameter files changed from paramT; default extension for variable init. files changed from parP. Test mode is now off by default.

Test mode can now be specified in script file by putting TestMode 1 in header.

Command used to build and run proteus now placed in build and run output files (build.out and run.out, by default).

The event file is now saved even if proteus ends prematurely.

15 November 1995, 14:45:06, Proteus Version 3.6

A new memory-access mode, exclusive read, has been added. An exclusive read gets an exclusive (READWRITE) copy of a block (line). This might be used if a value being read will soon be written. A new version of catoc now automatically uses exclusive reads for the lvalue of abbreviated assignments. (Such as a+=b.)

Memory and cache operations are now queued if new parameter CORRECT_QUEUES in cache.param is defined. (Without this parameter defined memory and cache operations would start as soon as they arrive. This results in unrealistic behavior when operation latency is high and one operation arrives shortly after another.) With the parameter set an operation does not start until after the previous one finishes.

The amount of time that memory and cache operations wait is collected and can be viewed using stats. (If CORRECT_QUEUES is not set the waiting time will always be zero.)

The state graph displayed by stats now indicates when utilization statistics are not being collected. This is indicated by a color on the highest-numbered processors state display.

Hook variables void (*user_statistics_on_hook_)() and void (*user_statistics_off_hook_)() are provided so that user statistics collection can now be turned on and off at the same time as utilization statistics. The variables should be set by user code; chaining of multiple hook functions is the responsibility of the user.

The summary information at the end of a simulation run now includes two fault rates: the number of page faults per second and the number of page faults per second requiring I/O. These are a good indicator of performance degradation due to memory access patterns.

The exact network model can now simulate a crossbar. A crossbar is selected by setting Nk to 1 and Ndim to the number of processors.

A new memory allocation function, OS_getMemUniform(unsigned long bytes_needed), has been added. It allocates a block of shared memory of length bytes_needed. If bytes_needed is smaller than V2's page size, it will be allocated on a single processor, otherwise it will be spread over all processors. The processor at which allocation starts is selected in round-robin fashion.

The code implementing shared memory uses host memory space more efficiently: Memory for cache directories is used only for active cache directories.

The -= bug fixed. (This was a catoc bug.)

14 August 1995, 19:06:16, Version 3.5.1

Bug fixes only:

Null log file name no longer crashes proteus. File names displayed in diagnostics (snapshot, for example) for included files now correct.

28 July 1995, 19:11:41, Version 3.5

Two new methods are used for choosing a line to evict from the cache. In the "fully random" method a line is chosen at random. In the "part random" method an invalid line, if any, is replaced. If no lines are invalid, a victim is chosen at random. The replacement method can be set using config. See the Bugs file for a description of the previous method. The previous method is still available, see cache.h.

The cache events written to the trace file now specify the percentage of cache hits. (Previously they specified something slightly higher than the hit ratio. See the Bugs file for details.)

A slight change was made in the cache coherence protocol used in network configurations. Before L3.5, a read to a line in an exclusive state on a another processor would result in the line being invalidated. Starting with L3.5, the line would change from exclusive to shared.

Snapshots and fatal-error messages now provide file and line-number information.

The SimMake file can now include an auxiliary file. The auxiliary file--to be used to extend the simulator--can specify items to add to the source, header, and object lists.

Character strings that describe some elements the Proteus configuration are automatically created. If a simulation title is not specified then a title is constructed using these strings. Some of these strings are printed when Proteus starts.

Option added to have Proteus ignore (and correctly complete) access to non-shared memory using shared-memory access functions (except the readtag functions). Option setable from config.

12 July 1995, 17:23:43, Version 3.4

Datapath width between network interface and processor is user setable using run-time initialized variable NWItransPerByte. This affects the sending and receiving of messages.

The arrival time of messages sent to the same processor is now based on the length of the message and NWItransPerByte.

A new statistic, network traffic, is collected for networks and the bus. A graph is now included in the stats Graphfile.

The communication pattern graphs have been removed from the Graphfile.

Network contention data is now collected for the bus. Note that the bus is not well modeled so contention and traffic volume statistics, and other aspects of system performance, will not be accurate.

Timing of messages sent by memory and cache controller is now based on the size of the messages.

A RATE_STAT macro was added. It works like SUM_STAT, except the quantity written to the trace file is divided by the amount of time +1 since the last write. Also like SUM_STAT, this macro is only documented in file event.h.

7 July 1995, 10:53:38, Version 3.3

New Statistics Collected.

Message timing data provided to user now separately includes arrival time at node and time that interrupt was issued.

Message timing statistics now collected for protocol messages. This feature can be switch on and off using the run-time initialized variable ProtoMsgStats.

Message statistics now include message delay (waiting time in network), and delay at network interface.

Interconnection Network Changes

The network datapath width can now be set by the user. This is done using the run-time initialized variable FlitsPerByte. Of course, it's not an integer. The bus interconnection does not make full use of this parameter.

The length parameter in the send_ipi and send_ipiV functions now specify the length of messages in bytes rather than flits. Results are identical to the previous interpretation if FlitsPerByte is set to its default value, 1.

Memory System Changes

The length of cache-coherence protocol packets is now user-setable. The run-time initialized variable ProtocolSize specifies the size of the op-code and address field in bytes. The size of any data sent is added on.

Before version L3.3 the length of all cache-coherence protocol packets was 1 byte (including data) when the "software coherence" option (CACHEWIDTHDATA) option was turned off. Starting with L3.3 the protocol-packet size is independent of the "software coherence" option setting and is based on the contents of the packet.

A new write mode, timid write, has been added. A timid write is similar to a conventional write, except that it can succeed or fail. If it succeeds, the write is completed as in a normal write. If it fails, the write is not performed. A timid write fails if the memory block was in the busy state when the request arrived.

The test & test & set semaphore now uses a timid write instruction.

The shared memory instructions use far less host CPU time at the expense of switch- and wait-handler functionality. Instead of polling the cache every cycle, the instructions only test the cache after each change. Thus, switch and wait handlers are called much less frequently, or never at all. This change has only a small effect on the performance of the simulated system.

The shared-memory instructions have different behavior when a livelock warning is encountered, eliminating certain types of fatal livelock. First, the livelock warning occurs after a certain number of cache changes, rather than a number of cycles. When a livelock warning occurs the access instruction will delay re-issuing the memory request by a randomly chosen amount of time. The random delay is re-chosen on subsequent tries, with an increasing delay distribution. Livelock warnings have been added to most memory access instructions.

The size of the cache directory is now user-setable. The parameter DIRECTORY_SIZEP specifies the directory size; it can be set using the config program.

Barrier

A barrier function has been added. (This is an adaptation of the basic_barrier available at the same site as Proteus.) The barrier is far more efficient, however does not have the functionality of, the barrier provided for ANL-macro compatibility. The barrier generates events if the new parameter WATCH_BARRIER is set. Graphs have been added to the Graphfile which use these events.

System State Events

The state event, which was not fully used in Proteus 3, is now used to specify the simulated system state. The default graph in the Graphfile makes use of these states. Currently, the system states illustrated are: idle, barrier idle, barrier busy, getting semaphore lock, and busy. Additional states can be added by the user.

Miscellaneous

The simulation title is now a run-time initialized variable, simulation_title_.

Network details are printed at the beginning of a run.

The snapshot menus will no longer go into an infinite loop when an EOF is encountered.

Many undefined configurations generate compile or run-time error messages.

18 June 1995, 16:20:09, Version 3.2

Added histogram metric code. Added variable initialization code. Added message journey statistics (journey stats) to modeled network and bus systems. Changed modeled network from store-and-forward to wormhole routing. Started using alpha test and numbered release directories.

Version numbers not used below.

9 March 1995, 11:53:27

Fixed bug in shared memory allocation routines. A context switch could occur while a semaphore was locked, resulting in deadlock. Fixed code in OSmem.common.h and OSmem_internal.c.

27 February 1995, 13:06:36

Large memory allocation (V2) routines available.

21 February 1995, 13:28:39

Added correct error messages in snapshot.c.

11 February 1995, 17:19:33

Changed _timer_handler() in rt_thread.ca so thread is woken if remaining number of ticks is < =0.

24 October 1994

Added code so that messages sent by the ipi functions could have timing information appended. Timing information will be appended if JOURNEY_STATS is defined and if the argc parameter in the send_ipi[V] is negative. The timing information appears as three extra integer arguments: the waiting time in input queues, the waiting time in output queues, and the number of hops. See Bernoulli.ca for example code.

25 July 94

Added delay for message arrival. (Note that modeled network uses store & forward FC.)

24 July 94

Moved delayed dispatch code to packet.c. Added messages for random request and natural request ordering to proteus startup code (in req_queue.c).

22 July 94

Fixed non-causal message arrival problem. Added a delayed dispatch packet simulator request. This used by net.exact.c::route_packet_handler_ in place of dispatch-packet. Also restored message length to receive time computation. Restored non-causal check in ihandler.c::ServiceInterrupt.

21 July 94

Changed next_request_ (after Findmin) so that it would properly dequeue a tail request.

11 July 94:

Removed message length change to route_handler (since interrupt would not occur at proper time anyway.)

3 July 94:

Changed net.exact.c route_handler to account for message length.

David M. Koppelman - koppel@ee.lsu.edu

Modified 14 Jan 1998 8:39 (1439 UTC)