lecture28.html
Cache Coherence
To achieve reasonable performance on a shared memory multiprocessor (whether it is SMP or DSM), it is important that each processor caches a portion of the shared memory space. Use of caches relies on the locality principle to achieve better performance through:
- replication. The same data item may be stored in more than one cache.
- migration. A data item may move to a place where it is being used.
Cache coherence: Each processor must see a valid version of the address space. In particular,
- If processor P writes to memory location X, then reads from X, and there are no intervening writes to X by other processors, then P will read the same value that it wrote.
- If P writes to X, the value it writes will eventually be visible to all other processors.
- Writes to the same location are serialized. That is, they appear to occur in the same order to all processors.
| time |
event |
cache A |
cache B |
memory |
| 0 |
0 |
|||
| 1 |
A writes 1 to X; B writes 2 to X |
1 |
2 |
1 |
| 2 |
A tells other processors to
update their caches with (X,1); B tells other processors to update
their caches with (X,2) |
2 |
1 |
1 |
| 3 |
A reads X, gets 2; B reads X,
gets 1 |
2 |
1 |
1 |
A related issue is cache consistency: When a value is written by one processor, when will it become visible to other processors? There are several ways to answer this question.
How is cache coherence enforced? The processors follow a protocol in accessing the cache and main memory in order to enforce cache consistency. The protocol indicates actions which need to be taken in case of a read miss, write miss, etc. Cache coherence protocols on current multiprocessors are implemented in hardware, using one of two general approaches:
- snooping. Used on bus-based multiprocessors. Each processor "snoops" on the bus so that it is aware of memory operations performed by other processors. It can use that information to determine when something it is caching is read or wrritten by another processor and take an appropriate action.
- directory-based. Used on multiprocessors without a shared bus. A directory is used to keep track of the status of each cached block.
These protocols use two ways to determine what to do on a write operation:
- write invalidate
- most common
- when P writes a value to X, other copies of X are invalidated
- write update
- when P writes a value to X, the same value is written to all other copies
- Multiple writes to the same word or block require multiple writes in a write update protocol, but only one invalidate operation in a write invalidate protocol.
- A write to X by P followed by a read from X by Q is faster if write update is used.
A bus-based cache coherence protocol
A common protocol implemented on Intel processors is MESI, so named because each cache block can be in one of four states:- M (modified). The block is cached only on this processor. It has been modified, so it may not agree with the version in main memory.
- E (exclusive). The block is cached only on this processor. It is unmodified.
- S (shared). The block is also cached on another processor or processors. It is unmodified.
- I (invalid).
Hennessy and Patterson present a simplified form of the protocol using only three states. (M and E are combined into a single E state.)
Assume:
- Individual processors use write back caches.
- A processor must have an exclusive copy of a block in order to modify it. That is, the protocol uses a "write invalidate" strategy.
- The bus supports three transaction types:
- read
- read and invalidate (write miss)
- write back

To summarize the actions taken:
| current state |
event |
bus action |
new state |
| invalid |
read miss |
read |
shared |
| invalid |
write miss |
read and invalidate |
exclusive |
| exclusive |
read hit |
none |
exclusive |
| exclusive |
write hit |
none |
exclusive |
| exclusive |
read miss |
write back followed by read |
exclusive |
| exclusive |
write miss |
write back followed by read and
invalidate |
exclusive |
| shared |
read hit |
none |
shared |
| shared |
read miss |
read |
shared |
| shared |
write hit |
read and invalidate |
exclusive |
| shared |
write miss |
read and invalidate |
exclusive |
| exclusive |
read on bus |
respond to read request by
putting data on bus, abort memory response |
shared |
| exclusive |
read and invalidate on bus |
write back, abort memory response |
invalid |
| shared |
read on bus |
none |
shared |
| shared |
read and invalidate on bus |
none |
invalid |
Note: In this version, "shared" really means "sharable". The cache block agrees with the contents of main memory. There may or may not be additional copies in other caches. The "exclusive" state really means "modified".
The MESI protocol distinguishes between shared (truly shared in more than one cache) and exclusive (in only one cache, in agreement with main memory). The MESI modified state corresponds to H&P's exclusive state. There are several differences between H&P's protocol and MESI:
- When a processor performs a read operation as a result of a local read miss, there are two possibilities
- If the block is read from memory, it is placed in the exclusive state
- If the block is read from another cache, it is placed in the shared state
- When a write hit occurs on a block in the exclusive state, it goes to the modified state, but no bus operation is required
- When a write hit occurs on a block in the shared state, an invalidate operation is performed on the bus. No data is read or written.
A: Both try to acquire the bus for a "read and invalidate" operation. One will get the bus first; that is, the bus serializes the requests.
Q: How does this affect performance?
- Cache blocks may need to be invalidated by the coherency protocol. So in addition to compulsory, capacity, and conflict misses, add one more "C": coherence misses.
- Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated)
Directory-based cache coherence
Use a directory to keep track of the state of each memory block.one entry per memory block, not just one per cache block
directory is distributed as is memory. Each node contains the directory entries for the memory on that node.
H&P present a three state directory-based protocol. The three states are:
- uncached
- shared (clean) - one or more cache copies exist, all agree with main memory
- exclusive (dirty) - one cached copy exists, it may not agree with main memory
- the state of the block
- for exclusive blocks, the owner of the block (i.e., the processor which holds the one cached copy)
- for exclusive and shared blocks, the set of processors caching the block (use a bit vector to implement this set)
- local node: the node making a request
- home node: the node holding the main memory copy and directory entry for a block
- owner: a node caching an exclusive copy of a block
Send read request to home node
if uncached
home node sends data to local node,
marks block shared, sets share set = { local node }
if sharedhome node sends data to local node,
adds local node to share set
if exclusivehome node sends a data fetch request to
owner
owner responds with data, sets state to shared in its own cache
home node receives data, marks block shared, adds local node to share set, writes data to main memory
home node sends data to local node
owner responds with data, sets state to shared in its own cache
home node receives data, marks block shared, adds local node to share set, writes data to main memory
home node sends data to local node
Write miss on local node
Send read and invalidate request to
home node
if uncached
home node sends data to local node,
marks block exclusive, sets owner = local node, share set = { local
node }
if sharedhome node sends data to local node,
sends invalidate message to all members of share set, marks block
exclusive, sets owner = local node, share set = { local node }
if exclusivehome node sends data fetch/invalidate
request to owner
owner responds with data, sets state to invalid in its own cache
home node receives data, sends data to local node, sets owner = local node, share set = { local node }
owner responds with data, sets state to invalid in its own cache
home node receives data, sends data to local node, sets owner = local node, share set = { local node }
The protocol must also respond to writeback operations performed by an owner node which overwrites an exclusively held cache block. In that case,
The owner invalidates the block and
sends the data to the home node
The home node writes the data to main memory and marks the block uncached
The home node writes the data to main memory and marks the block uncached
example Sun Fire E25K Server
- up to 18 CPU plug-in boards, each with up to 32GB main memory, for a total of 576GB
- up to 4 UltraSparc IV+ CPUs per board
- Cache coherence:
- Snooping used within single board
- Directory used between boards
Two other issues of importance in multiprocessor design:
Synchronization
- Memory is shared between processors
- Need to regulate access to memory
- Basic idea: get a lock on a variable before modifying it
- Support for locks must exist in hardware
- Test and set instruction
- applied to a 1-byte lock variable
- set it to 1 and read prior value so we know if it's safe to proceed
- must perform read and write as one atomic operation. How?
Consistency models
Potential problems even in cache coherent memory. Consider the following code:| processor 0: a = 0; a = 1; L1: if(b==0) critical section; |
processor 1: b = 0; b = 1; L2: if(a==0) critical section; |
Q: Is it possible for both processors to enter their critical sections at the same time?
A: Yes. (How?)
- Coherence doesn't guarantee consistent behavior in accessing more than one variable
- Need to enforce some sort of consistency
- First, need to define what it means for a memory to be consistent. Various models have been proposed.
- Strict consistency
- When a change is made to a variable, it is immediately visible to all reads by all processors
- Impossible to implement
- Sequential consistency
- Result of any execution is the same as it would be if all memory operations are executed in some sequential order, and the operations of any single processor are executed in sequential order
- Expensive to implement
- Weaker models
- Assume programs are synchronized; i.e. use locks to access shared data
- Sequential consistency is only needed for lock variables
- Other variables are either not shared, or shared only within critical sections