Node:Memory, Next: CPU, Prev: Overview, Up: Top, Contents, Top

Memory-Related Changes

Local changes include virtual memory, nonblocking writes, more efficient simulation of shared memory, automatic detection of shared-memory access, several additional access modes and cache configurations, and a virtual memory system allowing flexible distribution of dynamically allocated memory over memory modules.

VM: Virtual Memory
Nonblocking Writes: Nonblocking Writes and Memory Models
Memory Access Impl.: User code derefs and simulation of memory access.
Access Modes: Memory access modes.
Line Upgrades
Cache Confg: Setting directory size, replacement policy, etc.
V2: Memory allocation spanning several processors.

Node:VM, Next: Nonblocking Writes, Prev: Memory, Up: Memory, Contents, Top

Virtual Memory

Proteus version L3.11 simulates virtual-memory systems. Simulated systems use a fully associative translation lookaside buffer (TLB) and a single-level page table. Caches are physically mapped, selection of physical pages can use bin-hopping or color matching schemes. The selection scheme, page and TLB sizes are user set-able. Currently, the virtual address space is the same size as the physical address space, so the advantage is greater realism in simulating TLB misses. (Paging to and from disk can easily be added, and probably will be in a future release.)

The TLB miss handler is implemented in software (running on the simulated system) so TLB miss delays include the time to access the page table, including cache and additional TLB misses.

Virtual pages can be moved between processors, even while they are being accessed. An attempted access to a page being moved will trigger an interrupt handler which ultimately will update the TLB with the new location (when available). Once the TLB updated, the thread resumes and the access completes.

VM Page Size and TLB: Specifying VM Hardware
VM Page Selection: Selecting a physical page for a virtual page.
VM Page Relocation: Moving pages among processors.
VM Info: Macros and functions returning VM and addr info.

Node:VM Page Size and TLB, Next: VM Page Selection, Prev: VM, Up: VM, Contents, Top

Page Size and TLB Parameters

VM related parameters are placed the file mem.param, in addition, an RTI is used to specify page selection. Virtual memory is turned on by defining parameter `VM_ON'; the page size is set with `VM_PAGE_LOG_SIZE'. The number of TLB entries can be absolutely or relatively specified. (Relative to the minimum number needed.)

Proteus L3.11 uses directory poisoning to change virtual-to-physical mappings (to move pages between processors or to return virtual address space to the OS), as part of this scheme TLB entries have a limited lifetime. (See James Laudon and Daniel Lenoski, "The SGI Origin: A ccNUMA highly scalable server," 24th Annual International Symposium on Computer Architecture, pp. 241-251, June 1997.) Parameter `TLB_SOJ_LIMIT' specifies the lifetime.

Parameter: VM_ON: File `mem.param'. When defined the simulated system uses virtual memory.

Parameter: VM_PAGE_LOG_SIZE size: File `mem.param'. Specifies the base-2 log of the page size used in virtual memory simulations.

Parameter: TLB_SIZE_ABSOLUTE
Parameter: TLB_SIZE entries: File `mem.param'. Parameter `TLB_SIZE' specifies the number of TLB entries and `TLB_SIZE_ABSOLUTE' specifies how the `TLB_SIZE' parameter is interpreted. If `TLB_SIZE_ABSOLUTE' is defined then `TLB_SIZE' gives the number of TLB entries, if it's not, `TLB_SIZE' gives the number of entries beyond the fixed entries placed in the TLB by the operating system. (For example, let `TLB_SIZE' be 10 and suppose the OS fixes four entries in the TLB. If `TLB_SIZE_ABSOLUTE' is defined, the TLB could hold 10 entries, if it's not defined it could hold 14 entries.)

Parameter: TLB_SOJ_LIMIT time

File `mem.param'. Specifies the amount of time (in cycles) an entry can remain in the TLB (except fixed entries). Access to an entry that has been in the TLB longer than this is treated as a miss, and a fresh (probably identical) entry is fetched. (Coupled with directory poisoning, limiting the time entries can remain in the TLB allows for fast page migration.)

Lower settings increase the TLB miss rate, which can have a substantial negative impact on performance. Higher values can increase the amount of time it takes to allocate new memory after all free pages have been previously mapped, a concern for programs that migrate a large number of pages or programs that deallocate and reallocate large amounts of memory.

For example, consider the definitions below, placed in `mem.param'.

#define TLB_SIZE_ABSOLUTE
#define TLB_SIZE                      200
#define TLB_SOJ_LIMIT              100000
#define VM_PAGE_LOG_SIZE               12
#define VM_ON

These specify a system using virtual memory with 4096-byte pages and 200-entry TLBs. The 100,000 cycle TLB sojourn limit had a negligible impact on all programs tested.

Node:VM Page Selection, Next: VM Page Relocation, Prev: VM Page Size and TLB, Up: VM, Contents, Top

VM Page Selection

When virtual address space is allocated (using `vm_sbrk', usually called by library routines) page table entries are initialized for each virtual page, but physical pages are not yet chosen. A physical page is chosen for a virtual page when the memory is first accessed (and later if the page is moved).

There are two characteristics of a physical page that affects performance: its location and its color. The location of a page is the memory module at which it is located. (See section Page Relocation) A page's color is the value of a range of bits in addresses that access it, the range is the intersection of the page number (as defined by the virtual memory system) and the set number (as defined by cache configuration). If the intersection is empty there is one color, if the intersection contains one bit there are two colors, etc. For example, consider a system with a 4096-byte page size, and a cache with 8-byte lines and 8192 sets. Bits 12 through 31 (inclusive) of an address give the page number, and bits 3 through 14 (inclusive) give the set number; the page color is the value of bits 12 through 14.

Since caches are physically addressed page color can have an impact on cache performance. If every page selected had the same color only a fraction of the cache would be used (one over the number of possible page colors). If pages were selected arbitrarily, the tedious and time-consuming data-layout efforts of a conscientious programmer would be undermined.

Two systems are available in L3.11 for choosing the color of physical pages: bin hopping and color matching. In color matching, specified by setting RTI `vm_bin_hop' to 0, the OS will first look for a physical page of the same color as the virtual page. With bin hopping, specified by setting `vm_bin_hop' to 1, the OS uses a processor-local "next-color" counter. The counter is incremented (or set to zero) each time a page is allocated. (See Richard E. Kessler and Mark D. Hill, "Page Placement Algorithms for large real-indexed caches," ACM Transactions on Computer Systems, vol. 10, no. 4, November 1992, pp. 338-359.)

When a request is made to allocate a page (using procedure `vm_get_page' in os_vm.ca, which is ordinarily called by library routines) a page at the desired location and of the desired color is returned, if such a page is available. If not, an alternate page is chosen.

The steps below are followed until a page is found.

Use the selected scheme (based on the value of `vm_bin_hop') to find a page on the specified processor.
Starting at the value of the bin hop counter, find the next free page at the specified processor. (The bin hop counter is updated.)
Repeat steps 1 and 2 on all processors, starting at the next-higher processor number or processor zero.
Wait for the TLB sojourn limit, then repeat steps 1-3. This step is done four times. During this time pages in the poison state will become available, and hopefully will not be allocated by others.
Issue a fatal error. (The error is a limitation of the simulator, which at this release cannot move pages to a swap devices.)

RTI: int vm_bin_hop: Specifies the page selection to use, 1 indicates bin hopping, 0 indicates color matching.

In the sample RTI file shown below (`prot.par' by default), a bin-hopping page selection scheme is specified.

vm_bin_hop 1

Node:VM Page Relocation, Next: VM Info, Prev: VM Page Selection, Up: VM, Contents, Top

Page Relocation

Pages can be moved between processors at any time, even while being accessed. (If pages are to remain in place, it is most efficient to "move" a page to its location before accessing it, otherwise data would have to be copied.)

Pages are moved using function `vm_set_home'. In addition there are dynamic memory allocation routines that set the page locations automatically, in a variety of patterns. (See section V2 Memory Allocation.)

Function: int vm_set_home( void* addr, int bytes, int home);: Sets the preferred home for pages from addr to addr+bytes to processor home. The preference is written to the page table entry and if the pages are allocated they're moved. Pages can be moved while they are being accessed by the same or other processors. (Accesses will be stalled until the move is complete, but no special action is needed to restart them.)

Node:VM Info, Prev: VM Page Relocation, Up: VM, Contents, Top

VM Information Macros and Function

The following functions and macros are intended for user code. Simulated code might access the page size so that it can properly block data. Debugging and instrumentation code can use the metaFoo functions to access, but not disturb, simulated memory.

The macros and functions below are in file vm.h, except where noted otherwise.

Macro: VM_PPAGES_PP: Number of physical pages located at each processor (memory module).

Macro: VM_PAGES: Total number of virtual pages.

Macro: VM_PAGE_SIZE: Page size (in bytes).

Macro: VM_OFFSET_MASK: A mask for obtaining the page offset from an address.

Macro: PALETTE: The number of possible page colors.

Macro: VM_PAGE_FROM_ADDR( va ): Converts virtual address VA to a page number. (It's more than just a mask and shift.)

Macro: PAGE_NULL: A value returned in place of a page number, indicating that no page number could be returned. (Akin to the NULL value returned my C library functions such as `malloc'.)

Macro: P_NULL: A value returned in place of a physical address, indicating that no physical address could be returned. (Akin to the NULL value returned my C library functions such as `malloc'.) No, it's not zero; zero is a valid physical address. Defined in file `shmem.h'.

Macro: AddrVtoP( addr ): Converts virtual address addr to a physical address. This macro "runs" on the simulated system. Use function `metaVtoP' to convert addresses without disturbing the simulated system.

Function: HAddr metaVtoH(VAddr addr)

Function: HAddr metaVtoHT(VAddr addr);

Function: PAddr metaVtoP(VAddr addr);

Function `metaVtoH' converts virtual address addr to a host address, `metaVtoHT' converts virtual address addr to a host address of the tag (if any), and `metaVtoP' converts the address to a physical address. If variable `meta_info' is set to 1 before the call, human-readable information on where the data was found will be written to `char*' variables `meta_vtoh_info' and `meta_ptoh_info'. Collecting this information is time consuming, so set `meta_info' to zero when running the simulator normally. (The meta functions are used by the simulator for instrumentation and error checking.)

Macro: MetaVCopy(desto,v_from)

Macro: MetaPCopy(desto,p_from)

Copies object v_from or p_from to desto without disturbing the simulated system, both must be lvalues. They properly handle locations that are modified (dirty) in a cache, exist only within the network, and address ranges that span cache lines. For example, in the fragment below the macro is used to make a copy of a physical page table entry. (The code is in `hw_vm.c', and is used to obtain page table information for a consistency check.)

 P_Page_Entry ppe;  /* Physical page table entry. */
 MetaVCopy(ppe, ppage_table[p_page] );
 if( ppe.info.fresh_at <= GLOBALTIME ) safe++; else poison++;

Node:Nonblocking Writes, Next: Memory Access Impl., Prev: VM, Up: Memory, Contents, Top

Nonblocking Writes

Proteus version L3.11 supports non-blocking writes. When appropriate, store (write) instructions complete in the time needed for an integer instruction, even if they miss. The memory system will complete the writes in program order as the following instructions execute. The execution of a load (read) following a store depends upon the memory model and whether the needed data is being written. If the needed data is currently being written (by the same processor), the load suffers only an instruction delay. If the data is not being written and ordered reads have been selected, the load must wait for all previous stores to complete before it can complete. (Data is "currently begin written" with respect to a load if the data's store instruction preceded the load in program order, but the store itself has not yet completed.) If ordered reads is not selected, loads can start as soon as they are encountered.

Currently (L3.11), the memory model is sequentially consistent if ORDER_READS is defined. The memory model is not sequentially consistent is ORDER_READS is undefined and writes are nonblocking (by setting `SM_COMPL_BUFFER_SIZE' to a positive value).

The barrier and semaphore routines and all other Proteus system and library code works with or without ordered reads.

A function, `Shared_Memory_Wait', is provided that blocks until all pending accesses complete.

Nonblocking writes are controlled by macros defined in `mem.param':

Macro: SM_COMPL_BUFFER_SIZE: The maximum number of simultaneous accesses per processor. When set to zero, all loads and stores block.

Macro: ORDER_READS: When defined, a load cannot complete until all preceding memory accesses (in program order) have completed, unless the needed data is being written by an in-progress store. When undefined, loads start immediately. In version L3.11 the memory is sequentially consistent when ORDER_READS is defined.

Macro: COMPSTATS: When defined, information on the number of simultaneously active memory accesses information is collected and printed at the end of the simulation.

Function: void Shared_Memory_Wait(): Blocks until all pending memory accesses complete. Useful on systems without ordered reads to allow code that ordinarily requires a sequentially consistent memory model to run.

Node:Memory Access Impl., Next: Access Modes, Prev: Nonblocking Writes, Up: Memory, Contents, Top

Simulation of Shared Memory Access

Many changes were made in how Proteus simulates shared memory; these changes affect simulator performance but, with one exception, have only a small effect on user code. This section describes these changes; it should be of interest to advanced users and those modifying the simulator. The processing of shared-memory pointer dereferences described here is the exception; beginning users should know that special shared-memory pointer dereference operators (e.g., @apointer are no longer needed, use the C operators instead, *apointer.

Cache Simulation: Simulator implementation of cache.
Livelock Avoidance: Timing of requests to avoid livelock.
Memory Simulation: Host memory used to simulate shared memory.
Dereferencing: Conversion of user-code dereferences.
Operation Timing: Timing of simulated memory operations.

Node:Cache Simulation, Next: Livelock Avoidance, Prev: Memory Access Impl., Up: Memory Access Impl., Contents, Top

Efficient Cache Simulation

Before version L3.3 memory access functions, e.g., `Shared_Memory_Write', would dispatch a request to the cache controller to check the cache every simulated cycle. In the case of a cache hit, the cache only would be checked once. In the case of a miss, the cache would be checked many times.

A cache check would result in return value, indicating the status of the operation. (E.g., `MEM_READY' (finished), `MEM_WAIT' (maybe ready soon), and `MEM_SWITCH' (ready later).) The user could provide handlers for each of these return values, perhaps performing a task switch (if possible) for a `MEM_SWITCH' value.

Starting with version L3.3, the cache is only rechecked after it changes. A counter is kept at each processor's cache; it is incremented whenever a memory-to-cache protocol message is processed. A cache check simulator request, ultimately calling processor_request_pre, returns the value of this counter. The memory access function only repeats the cache check if the counter value is different than the value found at the last cache check. Otherwise a resume request is prepared and stored and a context switch to the simulator thread is made. The resume request is enqueued when the next protocol message arrives at the cache. (For further details, see `shmem.calls.c' and `cache.net.c'.) Thus, the handlers are called once per cache change, rather than once per cycle.

Node:Livelock Avoidance, Next: Memory Simulation, Prev: Cache Simulation, Up: Memory Access Impl., Contents, Top

Livelock Avoidance

Prior to version L3.3, "livelock" could occur when protocol messages were issued in a repeating pattern such that no memory access would complete. (Not to be confused with network livelock, which can't happen using Proteus' non-adaptively routed networks.) To prevent livelock in L3.3, memory accesses that fail a pre-determined number of times are re-tried after backing off (waiting) a randomly chosen amount of time. The backoff time is randomly chosen over a uniform distribution; the distribution increases with the number of failures. Starting with L3.8, the backoff is only used after every four tries tries. For example, suppose the backoff is 100 cycles. Then the next three cache changes will be examined without delay. After the third the thread will wait 100 cycles before examining the cache again.

Backoff delays are applied in the following way. A memory access will be tried without delay LIVELOCK_WARNING_COUNT times, at which time a warning is issued and a delay chosen. The delay, backoff, is chosen as follows. Initially, backoff is initialized to constant INITIAL_BACKOFF; at each adjustment time it is increased as follows:

backoff=backoff<<2;
if(backoff>LIVELOCK_BACKOFF_LIMIT)backoff=LIVELOCK_BACKOFF_LIMIT;
backoff=1+random()%backoff;

where LIVEBLOCK_BACKOFF_LIMIT bounds the backoff delay. Thus the backoff is exponential with the random delay separating contenders.

The thread waits backoff time if the bits in the binary representation of the number of tries, try_count, selected by mask BACKOFF_TRIES, are all zero. That is,

if( !(try_count & BACKOFF_TRIES) )AddTime(backoff);

In L3.8, with BACKOFF_TRIES = 0x3, that works out to every four tries.

The backoff delay is adjusted every LIVELOCK_BACKOFF_INTERVAL tries using the method described above. After FATAL_LIVELOCK tries, a fatal error is generated. Each of these quantities is set by defining constants in file `shmem.calls.c'.

Node:Memory Simulation, Next: Dereferencing, Prev: Livelock Avoidance, Up: Memory Access Impl., Contents, Top

Host Storage Used for Simulating Memory

The simulator uses a significant amount of host address space to hold cache directories (as a real system would). To reduce the amount of address space needed only active blocks use storage for cache directories. Blocks obtain and return storage from a pool of directory block structures. The pool initially contains storage for enough blocks to hold two processors cache lines; additional storage is allocated in large blocks as needed.

Node:Dereferencing, Next: Operation Timing, Prev: Memory Simulation, Up: Memory Access Impl., Contents, Top

Dereferencing of Pointers to Shared Addresses

Before L3.8 shared memory calls, such as Shared_Memory_Read, could be inserted by hand in source files or would be inserted by catoc where shared-memory dereference operators, `@' and `@>', were found. The catoc program reads the user's (compiler pre-processed) source (`.ca') file and writes a file (`.i') in which the operators are replaced by the appropriate function calls. These files, now including the shared-memory calls, would then be compiled into assembler code. The assembler code would be processed by the augment program, where cycle-counting code is inserted, library function calls are replaced by their cycle-counted versions, and loads of the dummy variable `current_time' are replaced by real time-reading code.

In version L3.8 ordinary shared-memory dereference operators, e.g., *a, b->c, and d[e], can (and should) be used in place of the old operators, @a, b@>c, and @&d[e]. The catoc program will pass these operators, unmodified, to the compiler which generates regular load and store instructions from them. The shared-memory calls implied by these operators will be inserted by augment. A load or store is assumed to access private memory if it is relative to the stack or frame or if it uses a constant address (constant to the assembler) or named offset. All other loads and stores are considered possible shared-memory accesses. Such instructions are replaced by sequences which test if the address is in shared memory, and if so call the appropriate function.

Shared memory addresses have their high bit set, so any load or store instruction not replaced by catoc that tried to access shared memory will cause a segmentation fault. (An alternate implementation using a segmentation fault handler took about four times longer to run.)

A new version of catoc can be instructed to treat shared memory dereference operators as ordinary operators by using the command-line option `-noat', used by default in `SimMake'. The option is intended for processing already-modified code.

The inserted code for shared-memory access was written to be fast, so that only those registers that need to be saved are saved. (Any register that is used by the procedure and which is volatile across function calls. No attempt is made to detect live registers.) Code size does suffer.

The compiler doesn't see the shared-memory function calls and so generates higher-quality code for the following reasons: The compiler normally saves volatile registers around function calls; these saves and restores would be counted if shared-memory function calls are present even though they would not occur on a true shared-memory system. Code movement across function calls is restricted, so shared-memory function calls limit compiler optimizations. The presence of function calls instead of memory accesses further limits compiler optimizations.

Node:Operation Timing, Prev: Dereferencing, Up: Memory Access Impl., Contents, Top

Timing of Cache and Memory Operations

In exact and modeled network systems using caches cache and memory operations are either serial or pipelined. For serial operations the device (cache or memory controller) performs one operation at a time, without overlap. For pipelined operations the device can perform several operations at once; the number is determined by the interfaces to the device. (The device can start operations as fast as the network interface can deliver request packets in the case of the cache and memory controllers, and as fast as the processor can issue requests, in the case of the cache controller.) In either case the amount of time taken to complete a particular operation is `CACHE_ACCESS_LATENCY' for caches and `MEM_ACCESS_LATENCY' for memories. (This includes only one cache or memory operation. A shared-memory access might include several cache and memory operations plus messages between processors.)

Serial operations are simulated by defining `CORRECT_QUEUES' in `cache.param'. Pipelined operations are simulated if `CORRECT_QUEUES' is undefined. Before Version 3.6 all operations were pipelined.

When operations are serial requests can spend a considerable amount of time waiting at a cache or memory device to be serviced. This waiting time is written into the events file and can be displayed using stats. Two graphs are used, one shows waiting time at caches. the other shows waiting time at memories; both show waiting time for each processor versus time. This waiting time does not include the time spent in the network interface.

Node:Access Modes, Next: Line Upgrades, Prev: Memory Access Impl., Up: Memory, Contents, Top

Memory Access Modes

Several shared memory access modes have been added.

Exclusive Read: A read mode used in anticipation of a write.
Unsigned Access: Reading unsigned data.
Timid Write: Write shared memory...if it's not too much trouble.

Node:Exclusive Read, Next: Unsigned Access, Prev: Access Modes, Up: Access Modes, Contents, Top

Exclusive-Read Shared Memory Access Mode

A new mode, exclusive read, is available for shared-memory access. An exclusive read has the same effect on a memory location (block) as a write. That is, copies of the block in other caches are invalidated, and the line's state is set to `READWRITE'. As with other reads it returns a value and does not modify the accessed location.

An exclusive read is performed by including `EX_READ' in the mode of a call to procedure `Shared_Memory_Read'. The return value is not affected by the mode.

An exclusive read might be used just before a write to the same memory location. Consider:

 a=*b;
 c=f1(a)/1.1;
 *b=c-d;

If `*b' were not cached then the code above would result in at least four protocol messages: two to read `*b' in the first line, and at least two to get an exclusive copy for the write. Below, the read to `b' is exclusive:

 a=Shared_Memory_Read(b, EX_READ | WORD);
 c=f1(a)/1.1;
 *b=c-d;

In this code, the read to `b' would require two or more protocol messages, but the write to `b' will result in a cache hit (unless the line holding `b' were replaced or invalidated). Thus fewer time-consuming protocol messages are required.

Node:Unsigned Access, Next: Timid Write, Prev: Exclusive Read, Up: Access Modes, Contents, Top

Unsigned Memory Access Mode

Starting with L3.8 unsigned accesses are supported. In signed accesses the memory system sign-extends a quantity loaded from memory into a register larger than the item read. For example a `-1' stored in a variable declared char, when loaded into a register (which is the same size as an `int') will be changed from 8 ones (0x000000FF) to 32 ones (0xFFFFFFFF, on a 32-bit system). An unsigned access would load the number unchanged.

Unsigned accesses are specified by placing `UNSIGNED' in the mode argument of shared memory calls. The correct access mode is used when shared memory operations are automatically inserted (by augment).

For example,

int v1,v2,v3,v4;
unsigned char *volatile my_uchar;
char *volatile my_char;

my_uchar = (unsigned char*) (&our_mem[CURR_PROCESSOR]);
my_char = (char*) (&our_mem[CURR_PROCESSOR]);

*my_uchar = 255;  /* Write 255 to shared memory. */
v1 = *my_uchar;
v2 = *my_char;
v3 = Shared_Memory_Read(my_uchar, BYTE | UNSIGNED );
v4 = Shared_Memory_Read(my_uchar, BYTE  );
printf(" Values are: *un char, %d; *char, %d; S_M_R(U), %d; S_M_R(), %d.\n",
        v1,v2,v3,v4);

results in output

 Values are: *un char, 255; *char, -1; S_M_R(U), 255; S_M_R(), -1.

Node:Timid Write, Prev: Unsigned Access, Up: Access Modes, Contents, Top

Timid Write Memory Access Mode

A timid write is similar to a conventional write, except that it can succeed or fail. If it succeeds, the write is completed as a normal write. If it fails, the write is not performed. A timid write fails if the memory block is in the busy state when the request, initiated by the timid write, arrives. (A memory block is in a busy state when a processor is getting an READWRITE (exclusive) copy of the block. A memory block is normally busy for only a short amount of time, while cached copies are being invalidated.) Thus, a timid write fails if some other write request(s) had been issued at about the same time.

Description

A timid write is performed using the `Shared_Memory_Write' call with `TIMID' as part of the mode and a pointer to a `MemOpValue' structure as the value. The `MemOpValue' structure holds the value to be written, the previous value (if the write succeeds), and a field indicating whether the write succeeded.

Structure: MemOpValue

Used in timid writes.

Word `new'; Value to be written.
Word `previous'; If write succeeded, previous value.
MemOutcome `outcome'; Outcome of write, either `MO_Busy', `MO_Completed', or `MO_Error'. `MO_Error' indicates that the call was made improperly or that there is a problem with Proteus. `MO_Busy' indicates that the write failed.

Example

The example below shows how a timid write is used to implement a test and test and set semaphore. The code is used to obtain a lock; at times many processors might be simultaneously executing the code. The lock is obtained by the processor that replaces a 1 at memory location `addr' with a 0. The read is executed until a 1 is found, at which time an attempt is made to write a 0 using a timid write. The timid write returns the previous value of `addr'. The while loop is exited if the write succeeds and a 1 is the previous value. Otherwise, the read is repeated.

The write might fail if some other processor replaced the 1 with a zero. There would be no point in completing the write in this case. (Because all cached copies have to be invalidated before performing a write, writes generate more network traffic than reads, so the write should be avoided if possible.) The code below would work without a timid write, however it would generate much more network traffic.

    while(1) {
        MemOpValue mo;
        mo.new=0;

        if ( (v = (int)Shared_Memory_Read((void *)addr, WORD)) > 0) {

           Shared_Memory_Write((void*)addr, &mo ,  WORD | TIMID  );

           if( mo.outcome==MO_Completed && mo.previous ) break;
        }

        fail_count++;
        
        }

This example is a shortened version of function `sem_P' in `sema.ttset.c'.

Node:Line Upgrades, Next: Cache Confg, Prev: Access Modes, Up: Memory, Contents, Top

Line Upgrades

In L3.11, upgrade messages were added to the directory-based cache coherence protocol. Upgrade messages are sent by a memory module to a cache requesting an exclusive copy of a block if the block's directory indicates that the cache already has a non-exclusive copy. The upgrade message grants exclusive access but does not contain data. Previously a "WDATA" message would be sent to the cache, granting exclusive access and also containing data (which in these cases would rarely be needed), adding unnecessary communication volume. (See James Laudon and Daniel Lenoski, "The SGI Origin: A ccNUMA highly scalable server," 24th Annual International Symposium on Computer Architecture, pp. 241-251, June 1997.)

Node:Cache Confg, Next: V2, Prev: Line Upgrades, Up: Memory, Contents, Top

Cache Configurations

One of three line-replacement policies can be chosen, the directory size can be set, and the size of protocol messages can be set.

Line Replacement: Available line-replacement policies.
Directory Size: Setting the cache directory size.
Protocol Size: Setting the size of protocol messages.
Protocol Change: Minor change to coherence protocol.

Node:Line Replacement, Next: Directory Size, Prev: Cache Confg, Up: Cache Confg, Contents, Top

Line Replacement Policies

A line replacement policy is used by the cache controller to determine which cache line to evict (in order to make space for a new line). In the memory system simulated by Proteus, lines are evicted when, in the case of reads, data arrives from the memory controller and, in the case of writes, when the protocol message granting exclusive access arrives from the memory controller.

Proteus can use one of three replacement policies: part-random, full-random, and deterministic. Part-random is the most efficient while full-random might be easier to implement. The deterministic policy was used in Proteus before version L3.5; this policy can result in uneven cache usage and address interference and so should be used with caution.

Under the part-random replacement policy the cache controller, when choosing a line to evict, will look for an invalid line. If such a line is found it will be evicted. Otherwise, a line to evict will be chosen at random.

Under the full-random replacement policy the cache controller will randomly choose a line to evict (using a uniform distribution).

Under the deterministic replacement policy the cache controller will choose a line to evict based upon the processor number and address of the new line. Let `p' denote the processor number and let `a' denote the address of the new line. Then line (p^a)%SET_SIZE would be replaced, where SET_SIZE is the associativity of the cache. (Expression p^a indicates a bitwise exclusive or of p and a; x%y indicates x modulo y.)

The policy used by Proteus is determined by the setting of the parameters described below. The parameters are defined in file `cache.param'. The full- and part-random policies can also be chosen using config.

Parameter: DETERMINISTIC_REPL: If set, Proteus will use a deterministic replacement policy. If not set Proteus will use a either a full- or part-random replacement policy. See `FULL_RANDOM_REPL'.

Parameter: FULL_RANDOM_REPL: If set and if `DETERMINISTIC_REPL' is not set Proteus will use a full-random replacement policy. If neither `DETERMINISTIC_REPL' nor `FULL_RANDOM_REPL' are set Proteus will use a part-random replacement policy.

Node:Directory Size, Next: Protocol Size, Prev: Line Replacement, Up: Cache Confg, Contents, Top

Setting the Directory Size

The size of cache directories is user-setable using parameter `DIRECTORY_SIZEP'. Cache directories are maintained at memory modules; there is one for each block (line). (A block's directory stores the location of cached copies of the block.) Proteus before version L3.3 used a fixed directory size of 10.

Parameter: int DIRECTORY_SIZEP: The number of entries in each of the memory modules' cache directories. If parameter `DIRECTORY_SIZEP' is zero or greater than the number of processors then there will be one directory entry for each processor, (called a full-map directory). The parameter is kept in file `cache.param'.

Node:Protocol Size, Next: Protocol Change, Prev: Directory Size, Up: Cache Confg, Contents, Top

Protocol Message Size

Proteus uses two sizes for protocol messages, messages sent between caches and memory modules. The sizes are for messages with and without data. (An example of a protocol message without data is a request for a read-only copy of a block. An example of a protocol message with data is the memory module's response to the request.) The size of a protocol message without data is ProtocolSize bytes, where ProtocolSize is the value of the run-time initialized variable `ProtocolSize'. The size of a protocol message with data is the number of bytes in a block plus ProtocolSize.

Note. Before version L3.3, protocol message size was always one flit if the software coherence (CACHEWITHDATA) option was turned off.

RTI: int ProtocolSize: The size, in bytes, of protocol messages, not including data. The value of variable `ProtocolSize' should include the size of the opcode, memory address, source and destination, plus any other data and overhead.

Node:Protocol Change, Prev: Protocol Size, Up: Cache Confg, Contents, Top

Protocol Change: Reading of Exclusive Memory Locations

Starting with L3.5 Proteus uses a different method for reading lines which are exclusive (READWRITE) in another cache. Before L3.5, such a read would result in the exclusive copy being invalidated. Starting with L3.5 the exclusive copy would change to the shared (READONLY) state. This change cannot be switched off.

Node:V2, Prev: Cache Confg, Up: Memory, Contents, Top

V2 Memory Allocation

The V2 memory allocation routines allocate a block of shared memory which can span any number processors. The address space used by user code to access the block is contiguous, that is, the memory addresses are consecutive, so that the usual pointer arithmetic can be performed. Because the blocks span several processors' memories, they can be much larger than they would be in Proteus 3. Addresses can be spread across modules in contiguous blocks, cyclically, and randomly. The time needed to allocate this memory is not fully simulated on non-VM systems, so it should not be used when timing-sensitive statistics are being collected.

Use

To use V2 memory allocation in systems without virtual memory the macro `V2_ON' should be defined in file `mem.param', no special settings are needed when virtual memory is used. (Parameter `V2_ON' can be set from the `config' program, under the simulator group.)

Two other macros are used by V2 on systems without virtual memory, `V2_MaxPages' and `V2_LogPageSize'. Macro `V2_LogPageSize' indicates the minimum amount that V2 will allocate from a processor's memory, called a page. Macro `V2_MaxPages' indicates the maximum number of pages that can be allocated. The maximum amount of memory that can be allocated (in C syntax) is V2_MaxPages * (1<<V2_LogPageSize). V2 maintains a page table of size V2_MaxPages entries, so that alot of memory is used when `V2_MaxPages' is small and `V2_LogPageSize' is large. Both `V2_MaxPages' and `V2_LogPageSize' can also be set using the config program.

On virtual memory systems V2 uses the VM page size. (See section Virtual Memory.)

No special header files need be included to use use V2 memory; the macros, variable, and function declarations are loaded via `user.h'.

The functions, variables, and macros needed to use V2 memory are described below.

Function: char* OS_getLargeMem(bytes_needed, numProcs, *prefProc)

Function: char* OS_getLargeMemD(bytes_needed, numProcs, *prefProc,arrangement)

unsigned long bytes_needed;
int numProcs;
int *prefProc;
V2_Dist arrangement;

`OS_getLargeMem' and `OS_getLargeMemD' allocate a block of shared memory bytes_needed bytes long. The memory is divided as evenly as possible among processors prefProc[0], prefProc[1], ..., prefProc[numProcs-1]. Each element should be a valid processor number, however they need not be distinct. (This might be done to allocate more of the memory on certain processors.) If all allocation attempts succeed then the memory will be allocated in one pass. A second pass is made if necessary (for example, if no memory can be allocated on prefProc[numProcs-1]). If the memory cannot be allocated by the end of the second pass then `OS_getLargeMem' will fail and return NULL. The partially allocated memory is not recovered. Upon successful allocation a V2 virtual address for the first byte in the block will be returned.

Function `OS_getLargeMem' divides the address space contiguously, that is, as many consecutive addresses as possible are placed on one processor. With `OS_getLargeMemD' the method of distribution can be selected using arrangement. If arrangement is `V2_Contiguous' then addresses are divided as in `OS_getLargeMem'. If arrangement is `V2_Distributed' then addresses are arranged cyclically, that is, the first page of addresses is placed on prefProc[0], the second on prefProc[1], and so on, returning to prefProc[0] and repeating the sequence as many times as necessary. If arrangement is `V2_Random' then pages will be arranged randomly.

A V2 address can be used as any other shared-memory address. Shared-memory access functions will automatically translate the V2 address back into the host machine virtual address (which Proteus normally uses). If the address is dereferenced by a non-Proteus function then a segmentation fault will result.

On systems using virtual memory the allocation and deallocation functions are fully simulated. On systms without virtual memory the functions use non-simulated memory and so will not give realistic results. They should not be called at locations in code where accurate timing is needed. On systems without virtual memory there is no corresponding free (deallocation) function.

Function: char* OS_getMemUniversal( unsigned long bytes_needed )
Function: char* OS_getMemUniversalD( unsigned long bytes_needed, V2_Dist arrangement ): Function `OS_getMemUniversal' allocates bytes_needed bytes of shared memory. If bytes_needed is not smaller than v2_smallestBigBlockSize then the memory will be allocated over all processors using OS_getLargeMem. Otherwise it will be allocated on a single processor using OS_getmem. Function `OS_getMemUniversalD' allocates memory as above, mapping addresses to modules as specified by arrangement. (See `OS_getLargeMem' above.)

Function: char* OS_getMemUniform( unsigned long bytes_needed )
Function: char* OS_getMemUniformD( unsigned long bytes_needed, V2_Dist arrangement ): Allocates a block of shared memory of length bytes_needed. If bytes_needed is smaller than V2's page size, it will be allocated on a single processor, otherwise it will be spread over all processors. The processor at which allocation starts is selected in round-robin fashion. That is, the first time `OS_getMemUniform' is called the first byte is allocated on processor 0. The second time the first byte is allocated on processor 1, and so on. Function `OS_getMemUniformD' allocates as above, but maps memory as specified by arrangement. (See `OS_getLargeMem' above.)

Macro: V2_ConvAddr( address)
Function: v2_ConvAddr( void * address): On non-virtual memory systems, converts address from the V2 address space to the host-machine address space, or leaves address unchanged if it is not in the V2 address space. Implemented as a macro for speed and a function for use while debugging. On virtual memory systems the `metaVtoH' function performs the corresponding function.

The following variables may be of use to user code.

Parameter: int v2_smallestBigBlockSize: The smallest allocation request that `OS_getMemUniversal' will spread over all processors. The default value is 1/3 of a single processor's memory size. This may be changed by user code, however it is a global variable. The user is responsible for synchronizing changes with use.

Information: char* v2_latest_error: Points to a string which explains what happened in the last call to `OS_getLargeMem' that returned NULL. The variable is initialized to a non-error error message and is not reset after successful calls.

Convenience: int v2_identity[ NO_OF_PROCESSORS ]: An identity array which can be used in calls to OS_getLargeMem.

Files

The code for V2 is in files `shVmem.c' and `shVmem.h'. Test code can be found in file `fatProg.ca'.