An Introduction to Multiprocessor Systems
Scalability: Broadcast, Directories and Snoop Filters
While the choice of write update versus write invalidate is relatively straight forward, other design choices are not nearly as simple. In order to maintain correctness, different processors must communicate with each other to ensure that cache remain coherent. The simplest and fastest method is to broadcast all cache coherency messages to all processors.
Unfortunately, this is also the least scalable. Every time a processor has a cache miss, it must broadcast a snoop request and receive all the responses before it can obtain the necessary data. Since each transaction has a single transmission and a single response, broadcasting is relatively low latency. However, the cache coherency traffic is proportional to the square of the number of processors. A modern 3GHz processor might encounter 20M misses/second; which doesn’t sound horrible. However, for a system with 8 processors and point to point interconnects, that could be ~240M messages/second per processor. At 32 processors, that could rise to 38,440M messages/sec per processor (in reality, these numbers would be slightly lower depending on system topology, but the scaling difficulties are clear).
Generally, broadcast is ideal for smaller systems, where the scalability isn’t needed, but the low latency is beneficial. Broadcast is especially well suited for shared bus designs, since any bus transaction is automatically broadcast. One example of a snoop broadcast system is any 1-4 socket Opteron system, or older Intel based servers based with a single front-side bus.
Directory based cache coherency protocols scale much better, at the cost of slightly higher latency. In a broadcast snooping system, the meta information (sharing and dirty status, etc.) is scattered; there is no single repository for the cache coherency information. As a result, broadcasts are needed to gather all the information. By consolidating all that information into a single logical entity, the amount of bandwidth required can be cut down dramatically. A directory tracks the location and sharing status of the entire physical address space (at the cache line level). Instead of broadcasting coherency messages, the requesting processor will send a request to the directory. The directory determines which other processors have that cache line, and then forward the request only to those processors; this cuts down on coherency messages because processors which do not have the requested cache line are not disturbed. Directory size is determined by the physical address space, cache line size and number of processors. Rather than implement a single directory, which would be a substantial bottleneck, the directory is typically distributed alongside the memory in each node. Consequently, almost all directories use a three hop protocol.
One disadvantage of a directory is that they tend to increase latency. As mentioned before, three hop protocols are higher latency than the simpler two hop protocols. Moreover, in a directory protocol, the home node must look up where to send messages in the directory, which increases the latency further. Another issue is that the directory has a substantially different traffic profile than regular memory. Main memory traffic is dominated by reads; but for every memory transaction, there must be both a read (to find out which CPUs must respond) and a write (to update sharing status) to the directory. The upside is that directory based servers can scale to much larger sizes because they waste less bandwidth: the GS1280 Alpha server, using the EV7, scales up to 128 processors, while SGI’s Origin and Altix servers can use 1024 processors. Figure 4 below shows the difference between a coherency transaction in a broadcast (top) and directory (bottom) based system.
Figure 4 – Broadcast versus Directory Cache Coherency Transaction
In this hypothetical situation, CPU 1 has the cache line requested by CPU 3. The directory can take advantage of this information to avoid disturbing CPU 2 entirely. In the broadcast system, all the processors must be probed and respond to the requester (CPU 3).
There is a middle road between broadcast coherency and directory based coherency, which has become more popular recently: the snoop filter (also referred to as a sparse or partial directory). While the implementation varies, the general idea is to maintain a partial directory. Instead of tracking the entire physical address space, just track all the caches in the system, or all data shared with remote nodes. This lowers the storage requirements substantially and preserves some of the benefits of a directory. Of course, since a hit in the snoop filter isn’t guaranteed, and in that case, the system falls back on broadcasting. This can lead to more unpredictable latency and bandwidth for a given transaction. Snoop filters are also slightly more flexible; they can easily be used with both two and three hop coherency protocols. Existing systems that use snoop filters include IBM’s X3, the Newisys HORUS, and Intel’s Blackford chipset. The Blackford chipset, as an example, contains duplicates of the tags from each processors L2 cache which are used for snoop filtering.