Personal tools
You are here: Home Classes Fall 2005 CS 317 lecture28.html
Document Actions

lecture28.html

by jdonalds last modified 2005-12-12 10:46

Cache Coherence


To achieve reasonable performance on a shared memory multiprocessor (whether it is SMP or DSM), it is important that each processor caches a portion of the shared memory space.  Use of caches relies on the locality principle to achieve better performance through:
  • replication.  The same data item may be stored in more than one cache.
  • migration.  A data item may move to a place where it is being used.
However, the use of caches in a multiprocessor leads to the related problems of cache coherence and cache consistency

Cache coherence:  Each processor must see a valid version of the address space.  In particular,
  1. If processor P writes to memory location X, then reads from X, and there are no intervening writes to X by other processors, then P will read the same value that it wrote.
  2. If P writes to X, the value it writes will eventually be visible to all other processors.
  3. Writes to the same location are serialized.  That is, they appear to occur in the same order to all processors.
If we are not careful something like this could happen:

time
event
cache A
cache B
memory
0



0
1
A writes 1 to X; B writes 2 to X
1
2
1
2
A tells other processors to update their caches with (X,1); B tells other processors to update their caches with (X,2)
2
1
1
3
A reads X, gets 2; B reads X, gets 1
2
1
1

A related issue is cache consistency:  When a value is written by one processor, when will it become visible to other processors?  There are several ways to answer this question.

How is cache coherence enforced?  The processors follow a protocol in accessing the cache and main memory in order to enforce cache consistency.  The protocol indicates actions which need to be taken in case of a read miss, write miss, etc.  Cache coherence protocols on current multiprocessors are implemented in hardware, using one of two general approaches:
  1. snooping.  Used on bus-based multiprocessors.  Each processor "snoops" on the bus so that it is aware of memory operations performed by other processors.  It can use that information to determine when something it is caching is read or wrritten by another processor and take an appropriate action.
  2. directory-based.  Used on multiprocessors without a shared bus.  A directory is used to keep track of the status of each cached block.

These protocols use two ways to determine what to do on a write operation:
  • write invalidate
    • most common
    • when P writes a value to X, other copies of X are invalidated
  • write update
    • when P writes a value to X, the same value is written to all other copies
Which works better?
  • Multiple writes to the same word or block require multiple writes in a write update protocol, but only one invalidate operation in a write invalidate protocol.
  • A write to X by P followed by a read from X by Q is faster if write update is used.

A bus-based cache coherence protocol

A common protocol implemented on Intel processors is MESI, so named because each cache block can be in one of four states:
  1. M (modified).  The block is cached only on this processor.  It has been modified, so it may not agree with the version in main memory.
  2. E (exclusive).  The block is cached only on this processor.  It is unmodified.
  3. S (shared).  The block is also cached on another processor or processors.  It is unmodified.
  4. I (invalid).
The processor uses the current state of a block to determine what action to take in case of a read hit, read miss, etc.  It also responds to bus events initiated by other processors; that is, it snoops the bus.  Every bus event is globally visible to all processors.

Hennessy and Patterson present a simplified form of the protocol using only three states.  (M and E are combined into a single E state.)

Assume:
  • Individual processors use write back caches.
  • A processor must have an exclusive copy of a block in order to modify it.  That is, the protocol uses a "write invalidate" strategy.
  • The bus supports three transaction types:
    • read
    • read and invalidate (write miss)
    • write back
The following diagram illustrates the protocol, modeled as a finite state machine:



To summarize the actions taken:

current state
event
bus action
new state
invalid
read miss
read
shared
invalid
write miss
read and invalidate
exclusive
exclusive
read hit
none
exclusive
exclusive
write hit
none
exclusive
exclusive
read miss
write back followed by read
exclusive
exclusive
write miss
write back followed by read and invalidate
exclusive
shared
read hit
none
shared
shared
read miss
read
shared
shared
write hit
read and invalidate
exclusive
shared
write miss
read and invalidate
exclusive
exclusive
read on bus
respond to read request by putting data on bus, abort memory response
shared
exclusive
read and invalidate on bus
write back, abort memory response
invalid
shared
read on bus
none
shared
shared
read and invalidate on bus
none
invalid

Note:  In this version, "shared" really means "sharable".  The cache block agrees with the contents of main memory.  There may or may not be additional copies in other caches.  The "exclusive" state really means "modified".

The MESI protocol distinguishes between shared (truly shared in more than one cache) and exclusive (in only one cache, in agreement with main memory).  The MESI modified state corresponds to H&P's exclusive state.  There are several differences between H&P's protocol and MESI:
  • When a processor performs a read operation as a result of a local read miss, there are two possibilities
    • If the block is read from memory, it is placed in the exclusive state
    • If the block is read from another cache, it is placed in the shared state
  • When a write hit occurs on a block in the exclusive state, it goes to the modified state, but no bus operation is required
  • When a write hit occurs on a block in the shared state, an invalidate operation is performed on the bus.  No data is read or written.
Q:  What if two processors write to a shared block at the same time?

A:  Both try to acquire the bus for a "read and invalidate" operation.  One will get the bus first; that is, the bus serializes the requests.

Q:  How does this affect performance?
  • Cache blocks may need to be invalidated by the coherency protocol.  So in addition to compulsory, capacity, and conflict misses, add one more "C":  coherence misses.
  • Average memory access time goes up, since writes to shared blocks take more time (other copies have to be invalidated)

Directory-based cache coherence

Use a directory to keep track of the state of each memory block.
one entry per memory block, not just one per cache block
directory is distributed as is memory.  Each node contains the directory entries for the memory on that node.

H&P present a three state directory-based protocol.  The three states are:
  • uncached
  • shared (clean) - one or more cache copies exist, all agree with main memory
  • exclusive (dirty) - one cached copy exists, it may not agree with main memory
For each block, the directory keeps track of
  • the state of the block
  • for exclusive blocks, the owner of the block (i.e., the processor which holds the one cached copy)
  • for exclusive and shared blocks, the set of processors caching the block (use a bit vector to implement this set)
The protocol describes how read and write misses are handled.  The following terminology is used:
  • local node:  the node making a request
  • home node:  the node holding the main memory copy and directory entry for a block
  • owner:  a node caching an exclusive copy of a block
Read miss on local node
Send read request to home node
if uncached
home node sends data to local node, marks block shared, sets share set = { local node }
if shared
home node sends data to local node, adds local node to share set
if exclusive
home node sends a data fetch request to owner
owner responds with data, sets state to shared in its own cache
home node receives data, marks block shared, adds local node to share set, writes data to main memory
home node sends data to local node

Write miss on local node
Send read and invalidate request to home node
if uncached
home node sends data to local node, marks block exclusive, sets owner = local node, share set = { local node }
if shared
home node sends data to local node, sends invalidate message to all members of share set, marks block exclusive, sets owner = local node, share set = { local node }
if exclusive
home node sends data fetch/invalidate request to owner
owner responds with data, sets state to invalid in its own cache
home node receives data, sends data to local node, sets owner = local node, share set = { local node }

The protocol must also respond to writeback operations performed by an owner node which overwrites an exclusively held cache block.  In that case,

The owner invalidates the block and sends the data to the home node
The home node writes the data to main memory and marks the block uncached


example  Sun Fire E25K Server
  • up to 18 CPU plug-in boards, each with up to 32GB main memory, for a total of 576GB
  • up to 4 UltraSparc IV+ CPUs per board
  • Cache coherence:
    • Snooping used within single board
    • Directory used between boards

Two other issues of importance in multiprocessor design:

Synchronization

  • Memory is shared between processors
  • Need to regulate access to memory
  • Basic idea:  get a lock on a variable before modifying it
  • Support for locks must exist in hardware
  • Test and set instruction
    • applied to a 1-byte lock variable
    • set it to 1 and read prior value so we know if it's safe to proceed
    • must perform read and write as one atomic operation.  How?

Consistency models

Potential problems even in cache coherent memory.  Consider the following code:

processor 0:
a = 0;
a = 1;
L1:  if(b==0)
critical section;
processor 1:
b = 0;
b = 1;
L2:  if(a==0)
critical section;

Q:  Is it possible for both processors to enter their critical sections at the same time?

A:  Yes.  (How?)
  • Coherence doesn't guarantee consistent behavior in accessing more than one variable
  • Need to enforce some sort of consistency
  • First, need to define what it means for a memory to be consistent.  Various models have been proposed.
  • Strict consistency
    • When a change is made to a variable, it is immediately visible to all reads by all processors
    • Impossible to implement
  • Sequential consistency
    • Result of any execution is the same as it would be if all memory operations are executed in some sequential order, and the operations of any single processor are executed in sequential order
    • Expensive to implement
  • Weaker models
    • Assume programs are synchronized; i.e. use locks to access shared data
    • Sequential consistency is only needed for lock variables
    • Other variables are either not shared, or shared only within critical sections





 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: