Local changes include virtual memory, nonblocking writes, more efficient simulation of shared memory, automatic detection of shared-memory access, several additional access modes and cache configurations, and a virtual memory system allowing flexible distribution of dynamically allocated memory over memory modules.
Proteus version L3.11 simulates virtual-memory systems. Simulated systems use a fully associative translation lookaside buffer (TLB) and a single-level page table. Caches are physically mapped, selection of physical pages can use bin-hopping or color matching schemes. The selection scheme, page and TLB sizes are user set-able. Currently, the virtual address space is the same size as the physical address space, so the advantage is greater realism in simulating TLB misses. (Paging to and from disk can easily be added, and probably will be in a future release.)
The TLB miss handler is implemented in software (running on the simulated system) so TLB miss delays include the time to access the page table, including cache and additional TLB misses.
Virtual pages can be moved between processors, even while they are being accessed. An attempted access to a page being moved will trigger an interrupt handler which ultimately will update the TLB with the new location (when available). Once the TLB updated, the thread resumes and the access completes.
VM related parameters are placed the file mem.param, in addition, an RTI is used to specify page selection. Virtual memory is turned on by defining parameter `VM_ON'; the page size is set with `VM_PAGE_LOG_SIZE'. The number of TLB entries can be absolutely or relatively specified. (Relative to the minimum number needed.)
Proteus L3.11 uses directory poisoning to change virtual-to-physical mappings (to move pages between processors or to return virtual address space to the OS), as part of this scheme TLB entries have a limited lifetime. (See James Laudon and Daniel Lenoski, "The SGI Origin: A ccNUMA highly scalable server," 24th Annual International Symposium on Computer Architecture, pp. 241-251, June 1997.) Parameter `TLB_SOJ_LIMIT' specifies the lifetime.
Lower settings increase the TLB miss rate, which can have a substantial negative impact on performance. Higher values can increase the amount of time it takes to allocate new memory after all free pages have been previously mapped, a concern for programs that migrate a large number of pages or programs that deallocate and reallocate large amounts of memory.
For example, consider the definitions below, placed in `mem.param'.
#define TLB_SIZE_ABSOLUTE #define TLB_SIZE 200 #define TLB_SOJ_LIMIT 100000 #define VM_PAGE_LOG_SIZE 12 #define VM_ON
These specify a system using virtual memory with 4096-byte pages and 200-entry TLBs. The 100,000 cycle TLB sojourn limit had a negligible impact on all programs tested.
When virtual address space is allocated (using `vm_sbrk', usually called by library routines) page table entries are initialized for each virtual page, but physical pages are not yet chosen. A physical page is chosen for a virtual page when the memory is first accessed (and later if the page is moved).
There are two characteristics of a physical page that affects performance: its location and its color. The location of a page is the memory module at which it is located. (See section Page Relocation) A page's color is the value of a range of bits in addresses that access it, the range is the intersection of the page number (as defined by the virtual memory system) and the set number (as defined by cache configuration). If the intersection is empty there is one color, if the intersection contains one bit there are two colors, etc. For example, consider a system with a 4096-byte page size, and a cache with 8-byte lines and 8192 sets. Bits 12 through 31 (inclusive) of an address give the page number, and bits 3 through 14 (inclusive) give the set number; the page color is the value of bits 12 through 14.
Since caches are physically addressed page color can have an impact on cache performance. If every page selected had the same color only a fraction of the cache would be used (one over the number of possible page colors). If pages were selected arbitrarily, the tedious and time-consuming data-layout efforts of a conscientious programmer would be undermined.
Two systems are available in L3.11 for choosing the color of physical pages: bin hopping and color matching. In color matching, specified by setting RTI `vm_bin_hop' to 0, the OS will first look for a physical page of the same color as the virtual page. With bin hopping, specified by setting `vm_bin_hop' to 1, the OS uses a processor-local "next-color" counter. The counter is incremented (or set to zero) each time a page is allocated. (See Richard E. Kessler and Mark D. Hill, "Page Placement Algorithms for large real-indexed caches," ACM Transactions on Computer Systems, vol. 10, no. 4, November 1992, pp. 338-359.)
When a request is made to allocate a page (using procedure `vm_get_page' in os_vm.ca, which is ordinarily called by library routines) a page at the desired location and of the desired color is returned, if such a page is available. If not, an alternate page is chosen.
The steps below are followed until a page is found.
In the sample RTI file shown below (`prot.par' by default), a bin-hopping page selection scheme is specified.
vm_bin_hop 1
Pages can be moved between processors at any time, even while being accessed. (If pages are to remain in place, it is most efficient to "move" a page to its location before accessing it, otherwise data would have to be copied.)
Pages are moved using function `vm_set_home'. In addition there are dynamic memory allocation routines that set the page locations automatically, in a variety of patterns. (See section V2 Memory Allocation.)
The following functions and macros are intended for user code. Simulated code might access the page size so that it can properly block data. Debugging and instrumentation code can use the metaFoo functions to access, but not disturb, simulated memory.
The macros and functions below are in file vm.h, except where noted otherwise.
Function `metaVtoH' converts virtual address addr to a host address, `metaVtoHT' converts virtual address addr to a host address of the tag (if any), and `metaVtoP' converts the address to a physical address. If variable `meta_info' is set to 1 before the call, human-readable information on where the data was found will be written to `char*' variables `meta_vtoh_info' and `meta_ptoh_info'. Collecting this information is time consuming, so set `meta_info' to zero when running the simulator normally. (The meta functions are used by the simulator for instrumentation and error checking.)
P_Page_Entry ppe; /* Physical page table entry. */ MetaVCopy(ppe, ppage_table[p_page] ); if( ppe.info.fresh_at <= GLOBALTIME ) safe++; else poison++;
Proteus version L3.11 supports non-blocking writes. When appropriate, store (write) instructions complete in the time needed for an integer instruction, even if they miss. The memory system will complete the writes in program order as the following instructions execute. The execution of a load (read) following a store depends upon the memory model and whether the needed data is being written. If the needed data is currently being written (by the same processor), the load suffers only an instruction delay. If the data is not being written and ordered reads have been selected, the load must wait for all previous stores to complete before it can complete. (Data is "currently begin written" with respect to a load if the data's store instruction preceded the load in program order, but the store itself has not yet completed.) If ordered reads is not selected, loads can start as soon as they are encountered.
Currently (L3.11), the memory model is sequentially consistent if ORDER_READS is defined. The memory model is not sequentially consistent is ORDER_READS is undefined and writes are nonblocking (by setting `SM_COMPL_BUFFER_SIZE' to a positive value).
The barrier and semaphore routines and all other Proteus system and library code works with or without ordered reads.
A function, `Shared_Memory_Wait', is provided that blocks until all pending accesses complete.
Nonblocking writes are controlled by macros defined in `mem.param':
Many changes were made in how Proteus simulates shared memory; these
changes affect simulator performance but, with one exception, have only
a small effect on user code. This section describes these changes;
it should be of interest to advanced users and those modifying
the simulator. The processing of shared-memory pointer dereferences
described here is the exception; beginning users should know
that special shared-memory pointer dereference operators (e.g.,
@apointer
are no longer needed, use the C operators
instead, *apointer
.
Before version L3.3 memory access functions, e.g., `Shared_Memory_Write', would dispatch a request to the cache controller to check the cache every simulated cycle. In the case of a cache hit, the cache only would be checked once. In the case of a miss, the cache would be checked many times.
A cache check would result in return value, indicating the status of the operation. (E.g., `MEM_READY' (finished), `MEM_WAIT' (maybe ready soon), and `MEM_SWITCH' (ready later).) The user could provide handlers for each of these return values, perhaps performing a task switch (if possible) for a `MEM_SWITCH' value.
Starting with version L3.3, the cache is only rechecked after it
changes. A counter is kept at each processor's cache; it is incremented
whenever a memory-to-cache protocol message is processed. A cache check
simulator request, ultimately calling processor_request_pre
,
returns the value of this counter. The memory access function only
repeats the cache check if the counter value is different than the value
found at the last cache check. Otherwise a resume request is prepared
and stored and a context switch to the simulator thread is made. The
resume request is enqueued when the next protocol message arrives at the
cache. (For further details, see `shmem.calls.c' and
`cache.net.c'.) Thus, the handlers are called once per cache
change, rather than once per cycle.
Prior to version L3.3, "livelock" could occur when protocol messages were issued in a repeating pattern such that no memory access would complete. (Not to be confused with network livelock, which can't happen using Proteus' non-adaptively routed networks.) To prevent livelock in L3.3, memory accesses that fail a pre-determined number of times are re-tried after backing off (waiting) a randomly chosen amount of time. The backoff time is randomly chosen over a uniform distribution; the distribution increases with the number of failures. Starting with L3.8, the backoff is only used after every four tries tries. For example, suppose the backoff is 100 cycles. Then the next three cache changes will be examined without delay. After the third the thread will wait 100 cycles before examining the cache again.
Backoff delays are applied in the following way.
A memory access will be tried without delay
LIVELOCK_WARNING_COUNT
times, at
which time a warning is issued and a delay chosen. The delay,
backoff
, is chosen as follows.
Initially, backoff
is initialized to constant INITIAL_BACKOFF
; at
each adjustment time it is increased as follows:
backoff=backoff<<2; if(backoff>LIVELOCK_BACKOFF_LIMIT)backoff=LIVELOCK_BACKOFF_LIMIT; backoff=1+random()%backoff;
where LIVEBLOCK_BACKOFF_LIMIT
bounds the backoff delay. Thus
the backoff is exponential with the random delay separating contenders.
The thread waits backoff
time if the bits in the binary
representation of the number of tries,
try_count
, selected by mask BACKOFF_TRIES
, are
all zero. That is,
if( !(try_count & BACKOFF_TRIES) )AddTime(backoff);
In L3.8, with BACKOFF_TRIES = 0x3
, that works out to every four tries.
The backoff delay is adjusted every
LIVELOCK_BACKOFF_INTERVAL
tries using the method described above.
After FATAL_LIVELOCK
tries, a fatal error is generated. Each of
these quantities is set by defining constants in file
`shmem.calls.c'.
The simulator uses a significant amount of host address space to hold cache directories (as a real system would). To reduce the amount of address space needed only active blocks use storage for cache directories. Blocks obtain and return storage from a pool of directory block structures. The pool initially contains storage for enough blocks to hold two processors cache lines; additional storage is allocated in large blocks as needed.
Before L3.8 shared memory calls, such as Shared_Memory_Read
,
could be inserted by hand in source files or would be inserted by catoc
where shared-memory dereference operators, `@' and `@>',
were found. The catoc program reads the user's (compiler pre-processed)
source (`.ca') file and writes a file (`.i') in which the
operators are replaced by the appropriate function calls. These files,
now including the shared-memory calls, would then be compiled into
assembler code.
The assembler code would be processed by the augment program,
where cycle-counting code is inserted, library function calls are
replaced by their cycle-counted versions, and loads of the
dummy variable `current_time' are replaced by real time-reading
code.
In version L3.8 ordinary shared-memory dereference operators, e.g.,
*a
, b->c
, and d[e]
, can (and should) be used in
place of the old operators, @a
, b@>c
, and @&d[e]
.
The catoc program will pass these operators, unmodified, to the compiler
which generates regular load and store instructions from them.
The shared-memory calls implied by these operators will be inserted
by augment.
A load or store is
assumed to access private memory if it is relative to the stack or frame
or if it uses a constant address (constant to the assembler) or named
offset. All other loads and stores are considered possible shared-memory
accesses. Such instructions are replaced by sequences which test if the
address is in shared memory, and if so call the appropriate function.
Shared memory addresses have their high bit set, so any load or store instruction not replaced by catoc that tried to access shared memory will cause a segmentation fault. (An alternate implementation using a segmentation fault handler took about four times longer to run.)
A new version of catoc can be instructed to treat shared memory dereference operators as ordinary operators by using the command-line option `-noat', used by default in `SimMake'. The option is intended for processing already-modified code.
The inserted code for shared-memory access was written to be fast, so that only those registers that need to be saved are saved. (Any register that is used by the procedure and which is volatile across function calls. No attempt is made to detect live registers.) Code size does suffer.
The compiler doesn't see the shared-memory function calls and so generates higher-quality code for the following reasons: The compiler normally saves volatile registers around function calls; these saves and restores would be counted if shared-memory function calls are present even though they would not occur on a true shared-memory system. Code movement across function calls is restricted, so shared-memory function calls limit compiler optimizations. The presence of function calls instead of memory accesses further limits compiler optimizations.
In exact and modeled network systems using caches cache and memory operations are either serial or pipelined. For serial operations the device (cache or memory controller) performs one operation at a time, without overlap. For pipelined operations the device can perform several operations at once; the number is determined by the interfaces to the device. (The device can start operations as fast as the network interface can deliver request packets in the case of the cache and memory controllers, and as fast as the processor can issue requests, in the case of the cache controller.) In either case the amount of time taken to complete a particular operation is `CACHE_ACCESS_LATENCY' for caches and `MEM_ACCESS_LATENCY' for memories. (This includes only one cache or memory operation. A shared-memory access might include several cache and memory operations plus messages between processors.)
Serial operations are simulated by defining `CORRECT_QUEUES' in `cache.param'. Pipelined operations are simulated if `CORRECT_QUEUES' is undefined. Before Version 3.6 all operations were pipelined.
When operations are serial requests can spend a considerable amount of time waiting at a cache or memory device to be serviced. This waiting time is written into the events file and can be displayed using stats. Two graphs are used, one shows waiting time at caches. the other shows waiting time at memories; both show waiting time for each processor versus time. This waiting time does not include the time spent in the network interface.
Several shared memory access modes have been added.
A new mode, exclusive read, is available for shared-memory access. An exclusive read has the same effect on a memory location (block) as a write. That is, copies of the block in other caches are invalidated, and the line's state is set to `READWRITE'. As with other reads it returns a value and does not modify the accessed location.
An exclusive read is performed by including `EX_READ' in the mode of a call to procedure `Shared_Memory_Read'. The return value is not affected by the mode.
An exclusive read might be used just before a write to the same memory location. Consider:
a=*b; c=f1(a)/1.1; *b=c-d;
If `*b' were not cached then the code above would result in at least four protocol messages: two to read `*b' in the first line, and at least two to get an exclusive copy for the write. Below, the read to `b' is exclusive:
a=Shared_Memory_Read(b, EX_READ | WORD); c=f1(a)/1.1; *b=c-d;
In this code, the read to `b' would require two or more protocol messages, but the write to `b' will result in a cache hit (unless the line holding `b' were replaced or invalidated). Thus fewer time-consuming protocol messages are required.
Starting with L3.8 unsigned accesses are supported. In signed accesses the memory system sign-extends a quantity loaded from memory into a register larger than the item read. For example a `-1' stored in a variable declared char, when loaded into a register (which is the same size as an `int') will be changed from 8 ones (0x000000FF) to 32 ones (0xFFFFFFFF, on a 32-bit system). An unsigned access would load the number unchanged.
Unsigned accesses are specified by placing `UNSIGNED' in the mode argument of shared memory calls. The correct access mode is used when shared memory operations are automatically inserted (by augment).
For example,
int v1,v2,v3,v4; unsigned char *volatile my_uchar; char *volatile my_char; my_uchar = (unsigned char*) (&our_mem[CURR_PROCESSOR]); my_char = (char*) (&our_mem[CURR_PROCESSOR]); *my_uchar = 255; /* Write 255 to shared memory. */ v1 = *my_uchar; v2 = *my_char; v3 = Shared_Memory_Read(my_uchar, BYTE | UNSIGNED ); v4 = Shared_Memory_Read(my_uchar, BYTE ); printf(" Values are: *un char, %d; *char, %d; S_M_R(U), %d; S_M_R(), %d.\n", v1,v2,v3,v4);
results in output
Values are: *un char, 255; *char, -1; S_M_R(U), 255; S_M_R(), -1.
A timid write is similar to a conventional write, except that it can succeed or fail. If it succeeds, the write is completed as a normal write. If it fails, the write is not performed. A timid write fails if the memory block is in the busy state when the request, initiated by the timid write, arrives. (A memory block is in a busy state when a processor is getting an READWRITE (exclusive) copy of the block. A memory block is normally busy for only a short amount of time, while cached copies are being invalidated.) Thus, a timid write fails if some other write request(s) had been issued at about the same time.
A timid write is performed using the `Shared_Memory_Write' call with `TIMID' as part of the mode and a pointer to a `MemOpValue' structure as the value. The `MemOpValue' structure holds the value to be written, the previous value (if the write succeeds), and a field indicating whether the write succeeded.
Used in timid writes.
The example below shows how a timid write is used to implement a test and test and set semaphore. The code is used to obtain a lock; at times many processors might be simultaneously executing the code. The lock is obtained by the processor that replaces a 1 at memory location `addr' with a 0. The read is executed until a 1 is found, at which time an attempt is made to write a 0 using a timid write. The timid write returns the previous value of `addr'. The while loop is exited if the write succeeds and a 1 is the previous value. Otherwise, the read is repeated.
The write might fail if some other processor replaced the 1 with a zero. There would be no point in completing the write in this case. (Because all cached copies have to be invalidated before performing a write, writes generate more network traffic than reads, so the write should be avoided if possible.) The code below would work without a timid write, however it would generate much more network traffic.
while(1) { MemOpValue mo; mo.new=0; if ( (v = (int)Shared_Memory_Read((void *)addr, WORD)) > 0) { Shared_Memory_Write((void*)addr, &mo , WORD | TIMID ); if( mo.outcome==MO_Completed && mo.previous ) break; } fail_count++; }
This example is a shortened version of function `sem_P' in `sema.ttset.c'.
In L3.11, upgrade messages were added to the directory-based cache coherence protocol. Upgrade messages are sent by a memory module to a cache requesting an exclusive copy of a block if the block's directory indicates that the cache already has a non-exclusive copy. The upgrade message grants exclusive access but does not contain data. Previously a "WDATA" message would be sent to the cache, granting exclusive access and also containing data (which in these cases would rarely be needed), adding unnecessary communication volume. (See James Laudon and Daniel Lenoski, "The SGI Origin: A ccNUMA highly scalable server," 24th Annual International Symposium on Computer Architecture, pp. 241-251, June 1997.)
One of three line-replacement policies can be chosen, the directory size can be set, and the size of protocol messages can be set.
A line replacement policy is used by the cache controller to determine which cache line to evict (in order to make space for a new line). In the memory system simulated by Proteus, lines are evicted when, in the case of reads, data arrives from the memory controller and, in the case of writes, when the protocol message granting exclusive access arrives from the memory controller.
Proteus can use one of three replacement policies: part-random, full-random, and deterministic. Part-random is the most efficient while full-random might be easier to implement. The deterministic policy was used in Proteus before version L3.5; this policy can result in uneven cache usage and address interference and so should be used with caution.
Under the part-random replacement policy the cache controller, when choosing a line to evict, will look for an invalid line. If such a line is found it will be evicted. Otherwise, a line to evict will be chosen at random.
Under the full-random replacement policy the cache controller will randomly choose a line to evict (using a uniform distribution).
Under the deterministic replacement policy the cache controller will
choose a line to evict based upon the processor number and address of
the new line. Let `p' denote the processor number and let `a'
denote the address of the new line. Then line (p^a)%SET_SIZE
would be replaced, where SET_SIZE is the associativity of the cache.
(Expression p^a
indicates a bitwise exclusive or of p
and
a
; x%y
indicates x
modulo y
.)
The policy used by Proteus is determined by the setting of the parameters described below. The parameters are defined in file `cache.param'. The full- and part-random policies can also be chosen using config.
The size of cache directories is user-setable using parameter `DIRECTORY_SIZEP'. Cache directories are maintained at memory modules; there is one for each block (line). (A block's directory stores the location of cached copies of the block.) Proteus before version L3.3 used a fixed directory size of 10.
Proteus uses two sizes for protocol messages, messages sent between caches and memory modules. The sizes are for messages with and without data. (An example of a protocol message without data is a request for a read-only copy of a block. An example of a protocol message with data is the memory module's response to the request.) The size of a protocol message without data is ProtocolSize bytes, where ProtocolSize is the value of the run-time initialized variable `ProtocolSize'. The size of a protocol message with data is the number of bytes in a block plus ProtocolSize.
Note. Before version L3.3, protocol message size was always one flit if the software coherence (CACHEWITHDATA) option was turned off.
Starting with L3.5 Proteus uses a different method for reading lines which are exclusive (READWRITE) in another cache. Before L3.5, such a read would result in the exclusive copy being invalidated. Starting with L3.5 the exclusive copy would change to the shared (READONLY) state. This change cannot be switched off.
The V2 memory allocation routines allocate a block of shared memory which can span any number processors. The address space used by user code to access the block is contiguous, that is, the memory addresses are consecutive, so that the usual pointer arithmetic can be performed. Because the blocks span several processors' memories, they can be much larger than they would be in Proteus 3. Addresses can be spread across modules in contiguous blocks, cyclically, and randomly. The time needed to allocate this memory is not fully simulated on non-VM systems, so it should not be used when timing-sensitive statistics are being collected.
To use V2 memory allocation in systems without virtual memory the macro `V2_ON' should be defined in file `mem.param', no special settings are needed when virtual memory is used. (Parameter `V2_ON' can be set from the `config' program, under the simulator group.)
Two other macros are used by V2 on systems without
virtual memory, `V2_MaxPages' and `V2_LogPageSize'. Macro
`V2_LogPageSize' indicates the minimum amount that V2 will allocate
from a processor's memory, called a page. Macro `V2_MaxPages'
indicates the maximum number of pages that can be allocated. The maximum
amount of memory that can be allocated (in C syntax) is
V2_MaxPages * (1<<V2_LogPageSize)
. V2 maintains a page table of
size V2_MaxPages
entries, so that alot of memory is used when
`V2_MaxPages' is small and `V2_LogPageSize' is large. Both
`V2_MaxPages' and `V2_LogPageSize' can also be set using the
config program.
On virtual memory systems V2 uses the VM page size. (See section Virtual Memory.)
No special header files need be included to use use V2 memory; the macros, variable, and function declarations are loaded via `user.h'.
The functions, variables, and macros needed to use V2 memory are described below.
unsigned long bytes_needed; int numProcs; int *prefProc; V2_Dist arrangement;
`OS_getLargeMem' and
`OS_getLargeMemD'
allocate a block of shared memory
bytes_needed
bytes long. The memory is divided as evenly
as possible among processors
prefProc[0]
,
prefProc[1]
, ...,
prefProc[numProcs-1]
. Each element should be a valid
processor number, however they need not be distinct. (This might
be done to allocate more of the memory on certain processors.)
If all allocation attempts succeed then
the memory will be allocated in one pass. A second pass is made
if necessary (for example, if no memory can be allocated
on prefProc[numProcs-1]
). If the memory cannot be allocated
by the end of the second pass then `OS_getLargeMem' will fail
and return NULL
. The partially allocated memory
is not recovered. Upon successful allocation
a V2 virtual address for the first
byte in the block will be returned.
Function `OS_getLargeMem' divides the address space contiguously,
that is, as many consecutive addresses as possible are placed on one
processor. With `OS_getLargeMemD' the method of distribution can be
selected using arrangement. If arrangement is
`V2_Contiguous' then addresses are divided as in
`OS_getLargeMem'. If arrangement is `V2_Distributed'
then addresses are arranged cyclically, that is, the first page of
addresses is placed on prefProc[0]
, the second on
prefProc[1]
, and so on, returning to prefProc[0]
and
repeating the sequence as many times as necessary. If arrangement
is `V2_Random' then pages will be arranged randomly.
A V2 address can be used as any other shared-memory address. Shared-memory access functions will automatically translate the V2 address back into the host machine virtual address (which Proteus normally uses). If the address is dereferenced by a non-Proteus function then a segmentation fault will result.
On systems using virtual memory the allocation and deallocation functions are fully simulated. On systms without virtual memory the functions use non-simulated memory and so will not give realistic results. They should not be called at locations in code where accurate timing is needed. On systems without virtual memory there is no corresponding free (deallocation) function.
v2_smallestBigBlockSize
then the memory will be allocated over
all processors using OS_getLargeMem
.
Otherwise it will be allocated on a single processor
using OS_getmem
.
Function `OS_getMemUniversalD' allocates memory as above, mapping
addresses to modules as specified by arrangement. (See
`OS_getLargeMem' above.)
address
from the V2 address space to the host-machine
address space, or leaves address
unchanged if it is not in
the V2 address space. Implemented as a macro for speed and a function
for use while debugging. On virtual memory systems the `metaVtoH'
function performs the corresponding function.
The following variables may be of use to user code.
OS_getLargeMem
.
The code for V2 is in files `shVmem.c' and `shVmem.h'. Test code can be found in file `fatProg.ca'.