[Cockcroft98] Chapter 11. System Architectures

来源:百度文库 编辑:神马文学网 时间:2024/04/27 00:11:51

Chapter 11. System Architectures

Thischapter describes the architecture of recent SPARC-based systems andexplores the performance trade-offs that are made throughout the range.The appropriate SPEC benchmarks are listed for each machine. I firstdescribe the uniprocessor SPARC machines, then I describe in detail themultiprocessor machines. Subsequent chapters describe in more depth thecomponents that make up these systems.

SPARC Architecture and Implementation

TheSPARC architecture defines everything that is required to ensureapplication-level portability across varying SPARC implementations. Itintentionally avoids defining some things, like how many cycles aninstruction takes, to allow maximum freedom within the architecture tovary implementation details. This is the basis of SPARC’s scalabilityfrom very low cost to very high performance systems. Implementationdifferences are handled by kernel code only, so that the instructionset and system call interface are the same on all SPARC systems. TheSPARC Compliance Definition, controlled by the independent SPARCInternational organization, specifies this interface. Within thisstandard there is no specification of the performance of a compliantsystem, only its correctness. The performance depends on the chip setused (i.e., the implementation) and the clock rate at which the chipset runs. To avoid confusion, some terms need to be defined.

Instruction Set Architecture (ISA)

TheISA is defined by the SPARC Architecture Manual. SPARC Internationalhas published Version 7, Version 8, and Version 9, and the IEEE hasproduced a standard based on Version 8, IEEE 1754. Version 9 definesmajor extensions including 64-bit addressing in an upward-compatiblemanner for user-mode programs. Prentice Hall has published both theVersion 8 and Version 9 SPARC Architecture manuals.

SPARC Implementation

Achip-level specification, the SPARC implementation defines how manycycles each instruction takes and other details. Some chip sets defineonly the integer unit (IU) and floating-point unit (FPU); others definethe memory management unit (MMU) and cache design and may include thewhole lot on a single chip.

System Architecture

Systemarchitecture defines a board-level specification of everything thekernel has to deal with on a particular machine. It includes internaland I/O bus types, address space uses, and built-in I/O functions. Thislevel of information is documented in the SPARCengine™ User Guides thatare produced for the bare-board versions of Sun’s workstation products.The information needed to port a real-time operating system to theboard and physically mount the board for an embedded application isprovided.

Kernel Architecture

Anumber of similar systems may be parameterized so that a single GENERICkernel image can be run on them. This grouping is known as a kernelarchitecture. Sun has one kernel architecture for VME-based SPARCmachines (Sun-4™), one for SBus-based SPARC machines (Sun-4c), one forthe VME and SBus combined 6U eurocard machines (Sun-4e), one forMBus-based machines and machines that use the SPARC reference MMU(Sun-4m), and one for XDBus-based machines (Sun-4d). These are listedin Table 11-1 on page 256.

Register Windows and Different SPARC CPUs

SPARCdefines an instruction set that uses 32 integer registers in aconventional way but has many sets of these registers arranged inoverlapping register windows.A single instruction is used to switch in a new set of registers veryquickly. The overlap means that 8 of the registers from the previouswindow are part of the new window, and these are used for fastparameter passing between subroutines. A further 8 global registers arealways the same; of the 24 that make up a register window, 8 are passedin from the previous window, 8 are local, and 8 will be passed out tothe next window. This usage is described further in The SPARC Architecture Manual Version 8 and Version 9 both published as books by Prentice Hall.

SomeSPARC implementations have seven overlapping sets of register windows,and some have eight. One window is always reserved for taking traps orinterrupts since these will need a new set of registers; the others canbe thought of as a stack cache for six or seven levels of procedurecalls with up to six parameters per call passed in registers. The othertwo registers hold the return address and the old stack frame pointer.If there are more than six parameters to a call, then the extra onesare passed on the external stack as in a conventional architecture. Youcan see that the register windows architecture allows much fastersubroutine calls and returns and faster interrupt handling than inconventional architectures that copy parameters out to a stack, makethe call, then copy the parameters back into registers. Programstypically spend most of their time calling up and down a few levels ofsubroutines; however, when the register windows have all been used, aspecial trap takes place, and one window (16 registers) is copied tothe stack in main memory. On average, register windows seem to cut downthe number of loads and stores required by 10–30 percent and provide aspeedup of 5–15 percent. Programmers must be careful to avoid writingcode that makes a large number of recursive or deeply nested calls andkeeps returning to the top level. If very little work is done at eachlevel and very few parameters are being passed, then the program maygenerate a large number of save and restore traps. The SunPro™SPARCompiler optimizer performs tail recursion elimination and leafroutine optimization to reduce the depth of the calls.

Ifan application performs a certain number of procedure calls and causesa certain number of window traps, the benefit of reducing the number ofloads and stores must be balanced against the cost of the traps. Theoverflow trap cost is very dependent on the time taken to store eightdouble-words to memory. On systems with write-through caches and smallwrite buffers, like the SPARCstation 1, a large number of write stallsoccur and the cost is relatively high. The SPARCstation 2 has a largerwrite buffer of two double-words, which is still not enough. TheSuperSPARC chip in write-through mode has an 8-double-word writebuffer, so will not stall; other systems with write-back caches willnot stall (unless a cache line needs to be updated).

TheSPARC V9 architecture supports a new, multiple-level, traparchitecture. This architecture greatly reduces the administrativeoverhead of register window traps since the main trap handler no longerhas to check for page faults. This feature increases the relativeperformance boost of register windows by reducing the trap time.

The Effect of Context Switches and Interrupts

Whena program is running on a SPARC chip, the register windows act as astack cache and provide a performance boost. Subroutine calls tend tooccur every few microseconds on average in integer code but may beinfrequent in vectorizable floating-point code. Whenever a contextswitch occurs, the register windows are flushed to memory and the stackcache starts again in the new context. Context switches tend to occurevery few milliseconds on average, and a ratio of several hundredsubroutine calls per context switch is a good one—there is time to takeadvantage of the register windows before they are flushed again. Whenthe new context starts up, it loads in the register windows one at atime, so programs that do not make many subroutine calls do not loadregisters that they will not need. Note that a special trap is providedthat can be called to flush the register windows; this trap is neededif you wish to switch to a different stack as part of a user writtenco-routine or threads library. When SunOS is running, a context switchrate of 1000 per second on each CPU is considered fast, so there arerarely any problems. There may be more concern about this ratio whenreal-time operating systems are running on SPARC machines, but thereare alternative ways of configuring the register windows that are moresuitable for real-time systems. These systems often run entirely inkernel mode and can perform special tricks to control the registerwindows.

Theregister window context switch time is a small fraction of the totalcontext-switch time. On machines with virtual write-back caches, acache flush is also required on some context switches. Systems havevarying amounts of support for fast cache flush in hardware. Theoriginal SunOS 4.0 release mapped the kernel u-area at the same addressfor all processes, and the u-area flush gave the Sun-4/260 with SunOS4.0 (the first SPARC machine) a bad reputation for poor context-switchperformance that some people mistakenly blamed on the register windows.

Identifying Different SPARC CPUs

You can use Table 11-1to find which SPARC CPU each machine has. In some cases, the CPU MHzshown is the actual clock rate rather than the nominal one used inproduct literature. Some UltraSPARC systems use a programmable clockgenerator that goes up in steps of 8 MHz; hence, 248 MHz is used ratherthan 250 MHz.

Table 11-1. Which SPARC IU and FPU Does Your System Have?
System (Kernel Architecture) CPU MHz CPU Type Sun-4/110 and Sun4/150 (sun4) 14 Fujitsu MB86901 Sun-4/260 and Sun4/280 (sun4) 16 Fujitsu MB86901 SPARCserver 300 series (sun4) 25 Cypress 601 SPARCserver 400 series (sun4) 33 Cypress 601 SPARCstation 1 and SLC (sun4c) 20 LSI/Fujisu L64911 SPARCstation 1+ and IPC (sun4c) 25 LSI/Fujitsu L64911 Tadpole SPARCbook 1 (sun4m) 25 Cypress 601 SPARCstation ELC (sun4c) 33 Fujitsu MB86903 SPARCstation IPX (sun4c) 40 Fujitsu MB86903 Tadpole SPARCbook 2 (sun4m) 40 Fujitsu MB86903 SPARCstation 2 (sun4c) 40 Cypress 601 SPARC PowerUP - SS2 and IPX upgrade (sun4c) 80 Weitek SPARCserver 600 Model 120/140 (sun4m) 40 Cypress 601 SPARCstation 10 & SPARCstation 20 (sun4m) 33–60 SuperSPARC - TI390Z50 SPARCstation 20 (sun4m) 75 SuperSPARC II - TI390Z50 Cray SuperServer CS6400 (sun4d) 60–85 SuperSPARC - TI390Z50 SPARCclassic & SPARCstation LX (sun4m) 50 microSPARC SPARCcenter 1000 & 2000 (sun4d) 50–60 SuperSPARC - TI390Z50 SPARCcenter 1000E & 2000E (sun4d) 85 SuperSPARC II - TI390Z50 Cray S-MP & FPS 500EA (sun4) 66 BIT B5000 SPARCstation Voyager (sun4m) 65 microSPARC II SPARCstation 4 and SPARCstation 5 (sun4m) 70–110 microSPARC II Tadpole SPARCbook 3 (sun4m) 85–110 microSPARC II SPARCstation 5 (sun4m) 170 Fujitsu TurboSPARC Tadpole SPARCbook 3XT & 3000 (sun4m) 170 Fujitsu TurboSPARC Ultra 1 (sun4u) 144–200 UltraSPARC I Ultra 2 (sun4u) 168–200 UltraSPARC I Ultra 2 (sun4u) 296 UltraSPARC II Ultra 30 and Ultra Enterprise 450 (sun4u) 248–296 UltraSPARC II Ultra Enterprise 3000, 4000, 5000, 6000 (sun4u) 168 UltraSPARC I Ultra Enterprise 3002, 4002, 5002, 6002, 10000 (sun4u) 248 UltraSPARC II Integrated Ultra II, DRAM control and PCI bus 266–300 UltraSPARC IIi Next generation chip announced 1997 600 UltraSPARC III

SuperSPARC

TheSuperSPARC can issue three instructions in one clock cycle; just aboutthe only instructions that cannot be issued continuously are integermultiply and divide, floating-point divide, and square root. A set ofrules controls how many instructions are grouped for issue in eachcycle. The main ones are that instructions are executed strictly inorder and subject to the following conditions:

  • Three instructions in a group

  • One load or store anywhere in a group

  • A control transfer (branch or call) ends a group at that point

  • One floating point operation anywhere in a group

  • Two integer-word results or one double-word result (including loads) per group

  • One shift can cascade into another operation, but not vice versa

Dependentcompare-and-branch is allowed, and simple ALU cascades are allowed (a +b + c). Floating-point load and a dependent operation are allowed, butdependent integer operations have to wait for the next cycle after aload.

Cypress/Fujitsu Ross HyperSPARC

TheHyperSPARC design can issue two instructions in one clock cycle. Thecombinations allowed are more restrictive than with SuperSPARC, but thesimpler design allows a higher clock rate to compensate. The overallperformance of a 66 MHz HyperSPARC is comparable to that of a 50 MHzSuperSPARC, but electronic design simulations run much faster onHyperSPARC, and database servers run much faster on SuperSPARC.

UltraSPARC

UltraSPARCimplements the 64-bit SPARC V9 architecture, but a hybrid specificationcalled SPARC V8plus is used as an extended 32-bit subset. A version ofSolaris that supports the 64-bit V9 address space should be availableduring 1998. UltraSPARC issues up to four instructions per clock cycle:one load/store, one floating point, and two integer. It also containsan extended set of operations called the Visual Instruction Set or VIS.VIS mostly consists of pixel-based operations for graphics and imageprocessing, but it also includes cache-line-oriented block moveoperations. These are used to accelerate large block-copy andblock-zero operations by the kernel and by user programs thatdynamically link to thebcopy(3C) andmemcpy(3C) routines. See also “When Does “64 Bits” Mean More Performance?” on page 134 and “UltraSPARC Compiler Tuning” on page 146.

SPARC CPU Cache Architectures

Sincethe hardware details vary from one machine implementation to anotherand the details are sometimes hard to obtain, the cache architecturesof some common machines are described below, divided into four maingroups: virtually and physically addressed caches with write-backalgorithms, virtual write-through caches, and on-chip caches. For moredetails of the hardware implementation of older systems, read Multiprocessor System Architectures,by Ben Catanzaro (SunSoft Press). For up-to-date information on SPARC,detailed datasheets and white papers are available from SunMicroelectronics via http://www.sun.com/sparc.

Virtual Write-through Caches

Mostolder desktop SPARCstations from Sun and the deskside SPARC system 300series use virtual write-through caches. The virtual write-throughcache works by using virtual addresses to decide the location in thecache that has the required data in it. This technique avoids the needto perform an MMU address translation except when there is a cache miss.

Datais read into the cache a block at a time, but writes go through thecache into main memory as individual words. This approach avoids theproblem of data in the cache being different from data in main memorybut may be slower since single word writes are less efficient thanblock writes would be. An optimization is for a buffer to be providedso that the word can be put into the buffer; then, the CPU can continueimmediately while the buffer is written to memory. The depth (one ortwo words) and width (32 or 64 bits) of the write buffer vary. If anumber of words are written back-to-back, then the write buffer mayfill up and the processor will stall until the slow memory cycle hascompleted. A double-word write on a SPARCstation 1 (and similarmachines) will always cause a write-buffer overflow stall that takes 4cycles.

On the machines listed in Table 11-2, the processor waits until the entire cache block has been loaded before continuing.

TheSBus used for memory accesses on the SS2, IPX, and ELC machines runs athalf the CPU clock rate. This speed difference may give rise to anextra cycle on the miss cost to synchronize the two buses and the extracycle occurs half of the time on average.

Table 11-2. Virtual Write-through Cache Details
Machine Clock Size Line Read Miss WB size WB Full Cost SS1, SLC 20 MHz 64 KB 16 B 12 cycles 1 word 2 cycles (4 dbl) SS1+, IPC 25 MHz 64 KB 16 B 13 cycles 1 word 2 cycles (4 dbl) SS330, 370, 390 25 MHz 128 KB 16 B 18 cycles 1 double 2 cycles ELC 33 MHz 64 KB 32 B 24–25 cycles 2 doubles 4–5 cycles SS2, IPX 40 MHz 64 KB 32 B 24–25 cycles 2 doubles 4–5 cycles

Virtual Write-back Caches

Thelarger desk-side SPARCservers use virtual write-back caches. The cacheuses virtual addresses as described above. The difference is that datawritten to the cache is not written through to main memory. Thisreduces memory traffic and allows efficient back-to-back writes tooccur. The penalty is that a cache line must be written back to mainmemory before it can be reused, so there may be an increase in the misscost. The line is written efficiently as a block transfer, then the newline is loaded as a block transfer. Most systems have a buffer thatstores the outgoing cache line while the incoming cache line is loaded,then the outgoing line is passed to memory while the CPU continues.

TheSPARCsystem 400 backplane is 64 bits wide and runs at 33 MHz,synchronized with the CPU. The SPARCsystem 600 uses a 64-bit MBus andtakes data from the MBus at full speed into a buffer; the cache itselfis 32 bits wide and takes extra cycles to pass data from the buffer inthe cache controller to the cache itself. The cache coherencymechanisms required for a multiprocessor machine also introduce extracycles. There is a difference between the number of MBus cycles takenfor a cache miss (the bus occupancy) and the number of cycles that theCPU stalls for. A lower bus occupancy means that more CPUs can be usedon the bus. Table 11-3 summarizes the details of the cache.

Table 11-3. Virtual Write-back Cache Details
Machine Size Line Miss Cost CPU Clock Rate Sun-4/200 series 128 Kb 16 bytes 7 cycles 16 MHz SPARCserver 400 128 Kb 32 bytes 12 cycles 33 MHz SPARCserver 600 model 120 64 Kb 32 bytes 30 cycles 40 MHz

Physical Write-back Caches: SuperSPARC and UltraSPARC

AllSuperSPARC- and UltraSPARC-based systems use physical write-back cachesfor their second-level cache. The first-level cache is described in “On-chip Caches” on page 261.

TheMMU translations occur on every CPU access before the address reachesthe cache logic. The cache uses the physical address of the data inmain memory to determine where in the cache it should be located. Inother respects, this type is the same as the virtual write-back cachedescribed above.

TheSuperSPARC SuperCache™ controller implements sub-blocking in its 1Mbyte cache. The cache line is actually 128 bytes but is loaded as fourseparate, contiguous, 32-byte lines. This approach cuts the number ofcache tags required at the expense of needing an extra three valid bitsin each tag. In XDBus mode for the SPARCserver 1000, the same chip setswitches to two 64-byte blocks. The SPARCcenter 2000 takes advantage ofthis switch to use four 64-byte blocks per 256-byte line, for a totalof 2 Mbytes of cache on a special module. This module cannot be used inany other systems since it requires the twin XDBus BusWatchers thathold duplicate tags for 1 Mbyte of cache each. In XDBus mode, the cacherequest and response are handled as separate transactions on the bus,and other bus traffic can interleave for better throughput but delayedresponse. The larger cache line takes longer to transfer, but thememory system is in multiple, independent banks, accessed over twointerleaved buses on the SPARCcenter 2000. When many CPUs access memoryat once, the single MBus memory system bottlenecks more quickly thanwith the XDBus.

TheUltraSPARC external cache controller is an integrated part of thedesign, and this integration greatly reduces latency. The miss costmeasured in CPU clock cycles is similar to that of SuperSPARC, butsince the clock rate is three to four times higher, the actual latencymeasured in nanoseconds is far lower. The SuperSPARC-based design isasynchronous, with the CPU and cache running from a different clock tothe system backplane. UltraSPARC is synchronous, with a central clocksource; this arrangement reduces latency because there is never a needto wait for clocks to synchronize to transfer data around the system.The extra performance is traded off against convenience of upgrades. Itis not possible to mix different speed UltraSPARC processors in thesame system. It was possible to do this with SuperSPARC, but it wasnever fully tested and supported. It was also possible to have smallerclock rate increments with SuperSPARC without changing the system busclock rate. With UltraSPARC, the CPU clock rate must be an exactmultiple of the system bus clock rate, often a multiple of 83 or 100MHz.

Thetwo-level cache architecture is so complex that a single cache misscost cannot be quoted. It can range from about 25 cycles in the idealcase, through about 40 cycles in a typical case, to over 100 cycles inthe worst case. The worst-case situation involves a read miss thatvictimizes a full but dirty cache line. Up to four blocks must bewritten back to main memory before the read data can be loaded.

On-chip Caches

Highlyintegrated SPARC chip sets like SuperSPARC, microSPARC, and UltraSPARCuse on-chip caches. The Fujitsu/Ross HyperSPARC uses a hybrid on-chipinstruction cache with off-chip unified cache. The microSPARC II hasfour times the cache size of the original microSPARC and has otherperformance improvements. One other development is the Weitek PowerUP,which is a plug-in replacement for the SS2 and IPX CPU chip that addson-chip caches and doubles the clock rate. UltraSPARC requires thesecond-level cache to be present as well.

Sincethe entire first-level cache is on chip, complete with its controllogic, a different set of trade-offs applies to cache design. The sizeof the cache is limited, but the complexity of the cache control logiccan be enhanced more easily. On-chip caches may be associativein that a line can exist in the cache in several possible locations. Ifthere are four possible locations for a line, then the cache is knownas a four-way, set-associative cache. It is hard to build off-chip caches that are associative, so they tend to be direct mapped, where each memory location maps directly to a single cache line.

On-chipcaches also tend to be split into separate instruction and data cachessince this division allows both caches to transfer during a singleclock cycle, thus speeding up load and store instructions. Thisoptimization is not often done with off-chip caches because the chipwould need an extra set of pins and more chips would be needed on thecircuit board.

Intelligentcache controllers can reduce the miss cost by passing the memorylocation that missed as the first word in the cache block, rather thanstarting with the first word of the cache line. The processor can thenbe allowed to continue as soon as this word arrives, before the rest ofthe cache line has been loaded. If the miss occurred in a data cacheand the processor can continue to fetch instructions from a separateinstruction cache, then this continuation will reduce the miss cost. Incontrast, with a combined instruction and data cache, the cache loadoperation keeps the cache busy until it has finished, so the processorcannot fetch another instruction anyway. SuperSPARC, microSPARC, andUltraSPARC do implement this optimization.

ThemicroSPARC design uses page-mode DRAM to reduce its miss cost. Thefirst miss to a 1-Kbyte region takes 9 cycles for a data miss;consecutive accesses to the same region avoid some DRAM setup time andcomplete in 4 cycles.

TheSuperSPARC processor implements sub-blocking in its instruction cache.The cache line is actually 64 bytes but is loaded as two separate,contiguous, 32-byte lines. If the on-chip cache is connected directlyto main memory, it has a 10-cycle effective miss cost; if it is usedwith a SuperCache, it can transfer data from the SuperCache to theon-chip cache with a 5-cycle cost. These details are summarized in Table 11-4.

Table 11-4. On-chip Cache Details
Processor I-Size I-line I-Assoc D-Size D-line D-Assoc D-Miss MB86930 2 KB 16 2 2 KB 16 2 N/A microSPARC 4 KB 32 1 2 KB 16 1 4–9 cycles microSPARC II 16 KB 32 1 8 KB 16 1 N/A PowerUP 16 KB 32 1 8 KB 32 1 N/A HyperSPARC 8 KB 32 1 256 KB [1] 32 1 10 cycles SuperSPARC 20 KB 64 (32) 5 16 KB 32 4 5–10 cycles UltraSPARC 16 KB 32 2 16KB 16+16 1 7–10 cycles

[1] This is a combined external instruction and data cache.

The SuperSPARC with SuperCache Two-Level Cache Architecture

Aspreviously described, the SuperSPARC processor has two sophisticatedand relatively large on-chip caches and an optional 1-Mbyte externalcache, known as SuperCache or MXCC (for MBus XBus Cache Controller)[1].It can be used without the external cache, and, for transfers, theon-chip caches work directly over the MBus in write-back mode. Formultiprocessor snooping to work correctly and efficiently, the on-chipcaches work in write-through mode when the external cache is used. Thistechnique guarantees that the on-chip caches contain a subset of theexternal cache so that snooping is required only on the external cache.An 8-double-word (64-byte) write buffer flushes through to the externalcache.

[1] . See “The SuperSPARC Microprocessor Technical White Paper.”

Cache Block Prefetch

A mode bit can be toggled (see Figure 11-1)in the SuperCache to cause cache blocks to be fetched during idle time.If a cache line has invalid sub-blocks but a valid address tag, thenthe missing sub-blocks will be prefetched. This mode is turned on bydefault in Solaris 2.2. It is off by default (but switchable) inSolaris 2.3 and later because database and SPEC benchmarks run betterwithout it in most cases.

Figure 11-1. Setting SuperCache Prefetch Mode
* Control the prefetch performance bit 
* on SuperSPARC/MXCC machines. Ignored on non-SuperSPARC machines.
* Valid values are boolean: non-zero means yes, zero means no.
* "use_mxcc_prefetch" controls whether sub-blocks are prefetched
* by MXCC. E-cache miss rates may improve, albeit at higher
* memory latency. Improvement depends on workload.
SET use_mxcc_prefetch = 1

Efficient Register Window-Overflow Handling

Onecommon case in SPARC systems is a register window overflow trap, whichinvolves eight consecutive double-word writes to save the registers.All eight writes can fit in the SuperSPARC and UltraSPARC writebuffers, and they can be written to the second-level cache in a bursttransfer.

UltraSPARC Two-Level Cache Architecture

UltraSPARCconfigures each of its on-chip caches differently. The instructioncache is associative using 32 byte lines, whereas the data cache isdirect mapped using two 16 byte sub-blocks per line. The on-chip cachesare virtual so that they can be accessed without first going throughthe MMU translation, to save time on loads and stores. In the event ofa level-one cache miss, the MMU translation is performed before thesecond-level, physically indexed cache is accessed. The UltraSPARC Idesign operates with the fastest external SRAM cache at up to 200 MHzwith the external cache also operating at 200 MHz. UltraSPARC IIoperates too quickly for the external cache, so it pipelines the accessdifferently and cycles the external cache at half the CPU clock rate.The UltraSPARC I at 167 MHz takes 7 clock cycles, 42 ns to load fromthe external cache. The UltraSPARC II at 300 MHz takes 10 clock cycles,30 ns to perform the same load. These are worst case latency figures,and pipelining allows independent cache transfer to start every threeor four cycles.

I/O Caches

Ifan I/O device is performing a DVMA transfer—for example, a diskcontroller is writing data into memory—the CPU can continue otheroperations while the data is transferred. Care must be taken to ensurethat the data written to by the I/O device is not also in the cache,otherwise inconsistencies can occur. On older Sun systems and the 4/260and SPARC system 300, every word of I/O is passed through the cache. Alot of I/O activity slows down the CPU because the CPU cannot accessthe cache for a cycle. The SPARC system 400 has an I/O cache that holds128 lines of 32 bytes and checks its validity with the CPU cache oncefor each line. The interruption to the CPU is reduced from once every 4bytes to once every 32 bytes. The other benefit is that single-cycleVMEbus transfers are converted by the I/O cache into cache-line-sizedblock transfers to main memory, which is much more efficient[2].The SPARCserver 600 has a similar I/O cache on its VMEbus-to-SBusinterface but has 1,024 lines rather than 128. The SBus-to-MBusinterface can use block transfers for all I/O, so it does not need anI/O cache. However, it does have its own I/O MMU, and I/O is performedin a cache-coherent manner on the MBus in the SPARCserver 10 andSPARCserver 600 machines, on the XDBus in the SPARCserver 1000 andSPARCcenter 2000 machines, and over the UPA in the UltraSPARC-basedmachines.

[2] See “A Cached System Architecture Dedicated for the System IO Activity on a CPU Board,” by Hseih, Wei, and Loo.

Block Copy Support

Thekernel spends a large proportion of its time copying or zeroing blocksof data. These blocks may be internal buffers or data structures, but acommon operation involves zeroing or copying a page of memory, which is4 Kbytes or 8 Kbytes. The data is not often used again immediately, soit does not need to be cached. In fact, the data being copied or zeroedwill normally remove useful data from the cache. The standard C libraryroutines for this operation are calledbcopy andmemcpy; they handle arbitrary alignments and lengths of copies.

Software Page Copies

Themost efficient way to copy a page on a system with a write-back cacheis to read a cache line, then write it as a block, using two bustransactions. The sequence of operations for a software copy loop isactually the following:

1. Load the first word of the source, causing a cache miss.

2. Fetch the entire cache line from the source.

3. Write the first word to the destination, causing a cache miss.

4. Fetch the entire cache line from the destination (the system cannot tell that you are going to overwrite all the old values in the line).

5. Copy the rest of the source cache line to the destination cache line.

6. Go back to the first stage in the sequence, using the next cache line for the source and destination until the transfer is completed.

7. At some later stage, when the destination cache line is reused by another read, the destination cache line will be flushed back to memory.

Thissequence is fairly efficient but involves three bus transactions; sincethe source data is read, the destination data is read unnecessarily,and the destination data is written. There is also a delay between thebus transactions while the cache line is copied, and the copy cyclesthrough the whole cache overwriting any preexisting data.

The Cache-Aligned Block Copy Problem

Thereis a well-known cache-busting problem with direct mapped caches. Itoccurs when the buffers are aligned in such a way that the source anddestination address are separated by an exact multiple of the cachesize. Both the source and destination use the same cache line, and asoftware loop doing 4-byte loads and stores with a 32-byte cache linewould cause eight read misses and eight write misses for each cacheline copied, instead of two read misses and one write miss. This caseis desperately inefficient and can be caused by simplistic coding likethat shown in Figure 11-2.

Figure 11-2. Cache-Busting Block-Copy Code
#define BUFSIZE 0x10000/* 64Kbytes matches SS2 cache size */ 
char source[BUFSIZE], destination[BUFSIZE];
for(i=0; i < BUFSIZE; i++)
destination[i] = source[i];

Thecompiler will allocate both arrays adjacently in memory, so they willbe aligned and the copy will run very slowly. The librarybcopy routine unrolls the loop to read for one cache line then write, which avoids the problem.

Block Copy Acceleration Hardware

TheSPARCserver 400 series machines, the SuperSPARC SuperCache controller,and the UltraSPARC Visual Instruction Set (VIS) extensions implementhardware block copy acceleration. For SuperSPARC, block copy iscontrolled by commands sent to special cache controller circuitry. Thecommands use privileged instructions (Address Space Identifier, or ASI,loads and stores) that cannot be used by normal programs but are usedby the kernel to control the memory management unit and cachecontroller in all SPARC machines. The SPARCserver 400 and SuperCachehave extra hardware that uses a special cache line buffer within thecache controller, and they use special ASI load and store addresses tocontrol the buffer. A single ASI load causes a complete line load intothe buffer from memory; a single ASI write causes a write of the bufferto memory. The data never enters the main cache, so none of theexisting cached data is overwritten. The ideal pair of bus transactionsoccur back-to-back with minimal delays between occurrences and use allthe available memory bandwidth. An extra ASI store that writes valuesinto the buffer is defined, so that block zero can be implemented bywriting zero to the buffer and performing block writes of zero at fullmemory speed without also filling the cache with zeroes. Physicaladdresses are used, so the kernel has to look up thevirtual-to-physical address translation before it uses the ASIcommands. The size of the block transfer is examined to see if thesetup overhead is worth the cost, and small copies are done in software.

TheVIS instruction set implemented by the UltraSPARC processor includesblock copy instructions for use by both kernel- and user-mode blockcopy. It takes virtual addresses and loads or stores 64 bytes to orfrom the floating-point register set, bypassing the caches butremaining consistent with any data that happens to be in the cachealready. This is the most efficient mechanism possible; it has very lowsetup overhead and is used by bothbcopy andmemcpy on UltraSPARC systems.

Memory Management Unit Designs

Unlikemany other processor architectures, the memory management unit is notspecified as part of the SPARC architecture. This permits greatflexibility of implementation, and over the years several verydifferent MMU designs have been implemented. These designs varydepending upon the cost, level of integration, and functionalityrequired for the system design.

The Sun-4 MMU: sun4, sun4c, sun4e Kernel Architectures

OlderSun machines use a “Sun-4™” hardware memory management unit that actsas a large cache for memory translations. It is much larger than theMMU translation cache found on more recent systems, but the entries inthe cache, known as PMEGs, are larger and take longer to load. A PMEGis a Page Map Entry Group, a contiguous chunk of virtual memory made upof 32 or 64 physical 8-Kbyte or 4-Kbyte page translations. ASPARCstation 1 has a cache containing 128 PMEGs, so a total of 32Mbytes of virtual memory can be cached. Note that a program can stilluse 500 Mbytes or more of virtual memory on these machines; it is thesize of the translation cache, which affects performance, that varies.The number of PMEGs on each type of Sun machine is shown in Table 11-5.

Table 11-5. Sun-4 MMU Characteristics
Processor Type Page Size Pages/PMEG PMEGS Total VM Contexts SS1(+), SLC, IPC 4 KB 64 128 32 MB 8 ELC 4 KB 64 128 32 MB 8 SPARCengine 1E 8 KB 32 256 64 MB 8 IPX 4 KB 64 256 64 MB 8 SS2 4 KB 64 256 64 MB 16 Sun 4/110, 150 8 KB 32 256 64 MB 16 Sun 4/260, 280 8 KB 32 512 128 MB 16 SPARCsystem 300 8 KB 32 256 64 MB 16 SPARCsystem 400 8 KB 32 1024 256 MB 64

Contexts in the Sun-4 MMU

Table 11-5shows the number of hardware contexts built into each machine. Ahardware context can be thought of as a tag on each PMEG entry in theMMU that indicates which process that translation is valid for. The tagallows the MMU to keep track of the mappings for 8, 16, or 64 processesin the MMU, depending on the machine. When a context switch occurs, ifthe new process is already assigned to one of the hardware contexts,then some of its mappings may still be in the MMU and a very fastcontext switch can take place. For up to the number of hardwarecontexts available, this scheme is more efficient than that for a moreconventional TLB-based MMU. When the number of processes trying to runexceeds the number of hardware contexts, the kernel has to choose oneof the hardware contexts to be reused, has to invalidate all the PMEGsfor that context, and has to load some PMEGs for the new context. Thecontext-switch time starts to degrade gradually and probably becomesworse than that in a TLB-based system when there are more than twice asmany active processes as contexts.

Monitoring the Sun-4 MMU

The number of various MMU-related cache flushes can be monitored by means ofvmstat -c.

% vmstat -c 
flush statistics: (totals)
usr ctx rgn seg pag par
821 960 0 0 123835 97

For Solaris 2, the required information can be obtained from the kernel statistics interface, with the undocumentednetstat -k option to dump out the virtual memory hardware address translation statistics (thevmhatstat section).

vmhatstat: 
vh_ctxfree 650 vh_ctxstealclean 274 vh_ctxstealflush 209 vh_ctxmappings
4507
vh_pmgallocfree 4298 vh_pmgallocsteal 9 vh_pmgmap 0 vh_pmgldfree 3951
vh_pmgldnoctx 885 vh_pmgldcleanctx 182 vh_pmgldflush 0 vh_pmgldnomap 0
vh_faultmap 0 vh_faultload 711 vh_faultinhw 27667 vh_faultnopmg 3193
vmhatstat:
vh_faultctx 388 vh_smgfree 0 vh_smgnoctx 0 vh_smgcleanctx 0
vh_smgflush 0 vh_pmgallochas 123790 vh_pmsalloc 4 vh_pmsfree 0
vh_pmsallocfail 0

Thisoutput is very cryptic, and I haven’t been able to work out any rulesfor which variables might indicate a problem. The Sun-4 MMU isuninteresting because the systems that use it are obsolete.

The SPARC Reference MMU: sun4m and sun4d Kernel Architectures

ManySun machines use the SPARC Reference MMU, which has an architecturethat is similar to that of many other MMUs in the rest of the industry.Table 11-6 lists the characteristics.

Table 11-6. SPARC Reference MMU Characteristics
Processor Types Page Sizes TLB Contexts Mappable Cypress SPARC/MBus chip set, e.g., SPARCserver 600 -120 and -140 & Tadpole SPARCbook 4 KB 64 4096 256 KB SuperSPARC e.g.,

SPARCserver 600 -41

SPARCstation 10, 20 models 30 to 61

SPARCserver 1000 and SC2000 to 60 MHz 4 KB 64 65536 256 KB SuperSPARC II e.g.,

SPARCstation 20 model 71

SPARCserver 1000E and 2000E - 85 MHz 4 KB 16 I and 64 D 65536 256 KB Fujitsu SPARClite embedded CPU 4 KB 32 256 128 KB microSPARC e.g., SPARCclassic and SPARCstation LX 4 KB 32 64 128 KB microSPARC II e.g., SPARCstation Voyager, SS4, SS5 4 KB 64 256 256 KB

A detailed description of the hardware involved can be found in Multiprocessor System Architectures, by Ben Catanzaro. Datasheets are available at http://www.sun.com/microelectronics.

Thereare four common implementations: the Cypress uniprocessor 604 andmultiprocessor 605 MMU chips, the MMU that is integrated into theSuperSPARC chip, the Fujitsu SPARClite, and the highly integratedmicroSPARC.

Unlikethe case with the Sun-4 MMU, there is a small, fully associative cachefor address translations (a translation lookaside buffer or TLB), whichtypically has 64 entries that map one contiguous area of virtual memoryeach. These areas are usually a single 4-Kbyte page, but largersegments are used for mapping the SX frame buffer. This processrequires contiguous and aligned physical memory for each mapping, whichis hard to allocate except for special cases. Each of the 64 entrieshas a tag that indicates the context to which the entry belongs. Thismeans that the MMU does not have to be flushed on a context switch. Thetag is 12 bits on the Cypress/Ross MMU and 16 bits on the SuperSPARCMMU, giving rise to a much larger number of hardware contexts than inthe Sun-4 MMU, so that MMU performance is not a problem when very largenumbers of users or processes are present. The total mappable memory isthe page size multiplied by the number of TLB entries. When this sizeis exceeded, the CPU will stall while TLB entries are replaced. Thiscontrol is done completely in hardware, but it takes several cycles toload the new TLB entry.

The SPARC Reference MMU Table-Walk Operation

Theprimary difference of the SPARC MMU from the Sun-4 MMU is that TLBentries are loaded automatically by table-walking hardware in the MMU.The CPU stalls for a few cycles, waiting for the MMU, but unlike manyother TLB-based MMUs or the Sun-4 MMU, the CPU does not trap to usesoftware to reload the entries itself. The kernel builds in memory atable that contains all the valid virtual memory mappings and loads theaddress of the table into the MMU once at boot time. The MMU then doesa table walk by indexing and following linked lists to find the rightpage translation to load into the TLB, as shown in Figure 11-3.

Figure 11-3. SPARC Reference MMU Table Walk


Thetable walk is optimized by the MMU hardware, which keeps the lastaccessed context, region, and segment values in registers, so that theonly operation needed is to index into the page table with the addresssupplied and load a page table entry. For the larger page sizes, thetable walk stops with a special PTE at the region or segment level.

TheSun-4 MMU-based systems can cache sufficient virtual memorytranslations to run programs many megabytes in size with no MMUreloads. Exceeding the MMU limits results in a large overhead. TheSPARC Reference MMU caches only 64 pages of 4 Kbytes at a time innormal use, for a total of 256 Kbytes of simultaneously mapped virtualmemory. The SRMMU is reloading continuously as the CPU uses more thanthis small set of pages, but it has an exceptionally fast reload, sothere is a low overhead.

The UltraSPARC MMU Architecture

TheSPARC Version 9 UltraSPARC architecture is a compatible, pure supersetof previous SPARC designs in user mode but is very different fromprevious SPARC designs when in kernel mode. The MMU architecture isimplemented partly in hardware and partly in software; the table-walkoperation is performed by a nested fast trap in software. This trap canoccur even during other trap routines since SPARC V9 defines severalsets of registers to support nested traps efficiently. The flexible MMUsupport is designed to handle both 32-bit and 64-bit address spacesefficiently, and different sizes from the SRMMU were chosen for theregion, segment, and page levels, as shown in Figure 11-4.UltraSPARC has a completely separate kernel context, which is mapped bymeans of a 4-Gbyte context TLB entry for code and data. In Solaris 2.6,intimate shared memory (ISM) is implemented with 4-Mbyte regions. Theseoptimizations greatly reduce the number of TLB miss traps that occur innormal operation.

Figure 11-4. UltraSPARC MMU Table Walk


In thevmhatstatstructure, the kernel maintains detailed statistics on the MMUoperation; you can read them by using the SE toolkit, or you can dumpthem out by usingnetstat -k, as shown in Figure 11-5.

Figure 11-5. Example UltraSPARC MMU Statistics
Code View: Scroll / Show All
% netstat -k | more 
...
vmhatstat:
vh_ctxfree 98546 vh_ctxdirty 12 vh_ctxsteal 21 vh_tteload 20329401 vh_hblk_hit
20220550
vh_hblk8_nalloc 1957 vh_hblk8_dalloc 28672 vh_hblk8_dfree 23962 vh_hblk1_nalloc 308
vh_hblk1_dalloc 5687 vh_hblk1_dfree 4536 vh_hblk8_startup_use 302
vh_pgcolor_conflict 85883 vh_uncache_conflict 30 vh_unload_conflict 4
vh_mlist_enter 0 vh_mlist_exit 0 vh_pagesync 117242342 vh_pagesync_invalid 0
vh_itlb_misses 0 vh_dtlb_misses 0 vh_utsb_misses 0 vh_ktsb_misses 0
vh_tsb_hits 0 vh_umod_faults 0 vh_kmod_faults 0 vh_slow_tsbmiss 2458310
vh_pagefaults 0 vh_uhash_searches 0 vh_uhash_links 0 vh_khash_searches 0
vh_khash_links 0 vh_steal_count 0 vh_kernel_xcalls 3141911 vh_user_xcalls 3882838


Early SPARC System Architectures

Theearliest SPARC systems are relatively simple. In this section, I groupthem into six generations of system designs. I have included benchmarkresults for early systems; the benchmarks are now obsolete and may behard to find. More recent systems have benchmark results that can beobtained from the http://www.specbench.org web site.

SPARC Uniprocessor VMEbus Systems

Atthe time SPARC was introduced, it was positioned at the high end of theproduct line. These systems were built with many components onlarge-format VMEbus cards. Each member of this family implementeddifferent variations on a common theme, with a single CPU, 128-Kbytecache to hold both data and instructions (apart from the Sun-4/100),Sun-4 MMU, single large memory system, some on-board I/O devices, and asingle VMEbus for expansion. These entities formed the “sun4” kernelarchitecture, and a single version of the kernel code could run on anyof these systems. The Sun-4 range is now obsolete, and relatively smallvolumes were produced; their many variations are not discussed in greatdetail, but Table 11-7 summarizes their benchmark results.

Table 11-7. SPEC89 Benchmark Results for SPARC Uniprocessor VMEbus Systems
Machine MHz SPECmark89 Compiler Sun4/110 and Sun4/150 14 6.1 (est) SunPro 1.0 Sun4/260 and Sun4/280 16.6 8.1 SunPro 1.0 Sun4/330, Sun4/370 and Sun4/390 25 13.4 SunPro 1.0 Sun4/470 and Sun4/490 33 19.4 SunPro 1.0

The First Generation of SPARC Uniprocessor Desktop Systems

Thisfamily of systems conforms to a set of kernel-to-hardware interfacesknown as the “sun4c” kernel architecture. The key design goal was tokeep the cost down, so all main memory accesses and I/O were carried bya single 32-bit SBus. A small set of very highly integrated chips wasused to implement these machines, so they all share a great manyarchitectural details. Compared to the earlier Sun-4 machines, theyhave a smaller 64-Kbyte cache, a simplified version of the Sun-4 MMU,and a smaller main memory system. The I/O bus changes from VME to SBus,and unlike previous systems, the SBus is also used to carry all theCPU-to-memory system traffic. The CPU boards are much smaller thanbefore, despite the inclusion of more I/O options as standard; see Figure 11-6.

Figure 11-6. The SPARCstation 1, 1+, IPC, SLC, 2, IPX, and ELC Family System Architecture


Inearlier machines (the SS1, 1+, IPC, and SLC), there was a single clockrate for the whole system, and main memory could provide data only athalf-speed, so there was a wait cycle on the SBus between each cycle ofdata transfer from memory. The SBus DMA interface on these machinesperformed only single-word transfers on the SBus, and the overhead ofacquiring the bus and sending the address for every word severelylimited DMA transfer rates.]

Inlater machines (the SS2, IPX, ELC), the CPU Integer Unit (IU),Floating-Point Unit (FPU), Memory Management Unit (MMU), and cache wereclocked at twice the speed of the SBus, and the memory system usedfaster DRAM to transfer data on every cycle. A newer version of theSBus DMA, called DMA+, transferred up to four words at a time on theSbus, for much better performance.

Anupgrade for the SPARCstation IPX and SPARCstation 2, called thePowerUP, is available from Weitek. It consists of a plug-in replacementCPU chip that runs at 80 MHz internally, using a 16-Kbyte instructioncache and an 8-Kbyte data cache. It uses the external 64-Kbyte cache asa second level. Generally, the rest of the system hardware and softwaredoesn’t need to know that anything is different, so it is quitecompatible and substantially faster.

I have had to provide two sets of benchmark results for these machines, shown in Table 11-8 and Table 11-9,because published results for the early machines ensued from the use ofSPEC89 only. Later published results reflected the use of SPEC92 aswell.

Table 11-8. SPEC89 Benchmark Results for First-Generation SPARC Desktops
Machine MHz SPECmark89 Compiler SPARCstation 1 20 8.8 SunPro 1.0 [1] SPARCstation SLC 20 8.8 SunPro 1.0 SPARCstation 1+ (old compiler) 25 11.8 SunPro 1.0 SPARCstation 1+ (new compiler) 25 13.5 SunPro 2.0 [2] SPARCstation IPC 25 13.5 SunPro 2.0 SPARCstation ELC 33 20.3 SunPro 2.0 SPARCstation IPX 40 24.4 SunPro 2.0 SPARCstation 2 40 25.0 SunPro 2.0

[1] The SunPro 1.0 code generator was used by C 1.1 and F77 1.4.

[2] The SunPro 2.0 code generator was used by C 2.0 and F77 2.0

Table 11-9. SPEC92 Benchmark Results for First-Generation SPARC Desktops
Machine MHz SPECint92 Compiler SPECfp92 Compiler SPARCstation 1 20 N/A   N/A   SPARCstation SLC 20 N/A   N/A   SPARCstation 1+ 25 13.8 SunPro 2.0 11.1 SunPro 2.0 SPARCstation IPC 25 13.8 SunPro 2.0 11.1 SunPro 2.0 SPARCstation ELC 33 18.2 SunPro 2.0 17.9 SunPro 2.0 SPARCstation IPX 40 21.8 SunPro 2.0 21.5 SunPro 2.0 SPARCstation 2 40 21.8 SunPro 2.0 22.7 SunPro 2.0 SPARCstation 2 PowerUP 80 32.2 SunPro 2.0 31.1 SunPro 2.0

Second-Generation Uniprocessor Desktop SPARC Systems

Thedesign goal was to provide performance comparable to that of theSPARCstation 2, but at the lowest possible cost, using the highestlevels of integration. The resulting pair of machines were theSPARCclassic™ and the SPARCstation LX. A large number of chips from theprevious designs were combined, and the highly integrated microSPARCCPU was used. microSPARC has a much smaller cache than does theSPARCstation 2 but compensates with a higher clock frequency and afaster memory system to get slightly higher overall performance. The6-Kbyte cache is split into a 4-Kbyte instruction cache and a 2-Kbytedata cache. Unlike previous designs, the sun4m kernel architecture isused along with a SPARC Reference Memory Management Unit (SRMMU). Thememory bus is 64-bits wide rather than 32, and main memory traffic doesnot use the same data path as SBus I/O traffic, which improvesperformance for both memory-intensive and I/Ointensive applications.The SBus DMA controller is integrated along with SCSI, Ethernet, andhigh-speed parallel ports into a single chip. The difference betweenthe two machines is that the SPARCstation LX adds a GX graphicsaccelerator, ISDN interface, and CD quality audio to the basicSPARCclassic design. Figure 11-6 illustrates the architecture; Table 11-10 summarizes the benchmark results.

Figure 11-7. The SPARCstation LX and SPARCclassic Family System Architecture


Table 11-10. SPEC Benchmark Results for Second-Generation SPARC Desktops
Machine MHz SPECint92 Compiler SPECfp92 Compiler SPARCclassic 50 26.3 SunPro 3.0 20.9 Apogee 0.82 SPARCstation LX 50 26.3 SunPro 3.0 20.9 Apogee 0.82

Third-Generation Uniprocessor Desktop SPARC Systems

Followingthe SPARCclassic and the SPARCstation LX, a much faster version of themicroSPARC processor, called microSPARC II, was developed and is usedin the SPARCstation 4, SPARCstation 5, and SPARCstation™ Voyager™. Themain differences are that microSPARC II has a much larger cache, runsat higher clock rates, and has a special graphics interface on thememory bus to augment the usual SBus graphics cards. As in the originalmicroSPARC, the sun4m kernel architecture is used along with the SPARCReference MMU. The SPARCstation 4 came later than the SPARCstation 5and is a heavily cost-reduced design. Most of the cost was saved in thepackaging restrictions, and architecturally it is the same as aSPARCstation 5. The SPARCstation™ Voyager™ is a nomadic system, and ithas an integrated driver for a large LCD, flat-screen display, powermanagement hardware, and a PCMCIA bus interface via the SBus. The speedof the SBus is either one-third, one-fourth, or one-fifth of theprocessor speed since the SBus standard specifies a maximum of 25 MHz.These speeds give rise to some odd SBus rates at the clock rates usedin the various versions of the SPARCstation 5. The final version of theSPARCstation 5/170 uses a new CPU design called TurboSPARC. It isderived from microSPARC II but has a 256-Kbyte external, level-twocache as well as a much higher clock rate, at 170 MHz. Figure 11-6 illustrates the architecture; Table 11-11 summarizes the benchmark results.

Figure 11-8. The SPARCstation Voyager and SPARCstation 4 and 5 Family System Architecture


Table 11-11. SPEC Benchmark Results for Third-Generation SPARC Desktops
Machine MHz SPECint92 Compiler SPECfp92 Compiler SPARCstation Voyager 60 47.5 Apogee 2.951 40.3 Apogee 2.951 SPARCstation 4/70 5/70 70 57.0 Apogee 2.951 47.3 Apogee 2.951 SPARCstation 4/85 5/85 85 64.1 Apogee 2.951 54.6 Apogee 2.951 SPARCstation 4/110 5/110 110 N/A   N/A   SPARCstation 5/170 170 N/A   N/A  

Entry-Level Multiprocessor-Capable Desktop

TheSPARCstation 10 and SPARCstation 20 are often used as uniprocessors,particularly since SunOS 4 is not supported on multiprocessorconfigurations. These machines are multiprocessor capable since theSPARC modules can be upgraded and a second module can be added. Thediagram shown in Figure 11-9is rather complex, but later sections will refer to this diagram as thedetails are explained. The entry-level models do not include the1-Mbyte SuperCache, so the processor is directly connected to the MBusand takes its clock signal from the bus. The SX option shown in thediagram is supported only on the SPARCstation 10SX and all SPARCstation20 models. These models also have a 64-bit SBus interface, although itis not supported until Solaris 2.4. Table 11-12 summarizes the benchmarks for these models.

Figure 11-9. SPARCstation 10 Model 30, 40, and SPARCstation 20 Model 50 Organization


Table 11-12. SPEC Benchmark Results for Entry-Level MP-Capable Desktops
Machine MHz SPECint92 Compiler SPECfp92 Compiler SPARCstation 10 Model 30 36 45.2 SunPro 3.0α 54.0 Apogee 1.059 SPARCstation 10 Model 40 40 50.2 SunPro 3.0α 60.2 Apogee 1.059 SPARCstation 20 Model 50 50 69.2 Apogee 2.3 78.8 Apogee 2.3

Multiprocessor-Capable Desktop and Server

TheSPARCserver 600 was the first Sun machine to use multiprocessor-capablearchitecture. It has a VMEbus interface coming off the SBus, as shownin Figure 11-10.It also uses a different memory controller and SIMM type that givesmore RAM capacity by using extra boards; there is no SX interface orparallel port. The SPARCstation 10 and 20 are as described in theprevious section except for the addition of a SuperCache controller and1 Mbyte of cache RAM. There are now two clocks in the machine, theSPARC module clock and the system clock. In the performance table, bothare shown as module/system. Benchmarks are summarized in Table 11-13.

Figure 11-10. SPARCsystem 10, 20, and 600 Model 41, 51, and 61 Organization


Table 11-13. SPEC Benchmark Results for High-end MP-Capable Desktops and Servers
Machine MHz SPECint92 Compiler SPECfp92 Compiler SPARCserver 600 Model 41 40.33/40 53.2 Sun 3.0α 67.8 Apogee 1.059 SPARCstation 10 Model 41 40.33/40 53.2 Sun 3.0α 67.8 Apogee 1.059 SPARCstation 10 Model 51 50/40 65.0 Sun 3.0α 83.0 Apogee 2.3 SPARCstation 20 Model 51 50/40 73.6 Apogee 2.3 84.8 Apogee 2.3 SPARCstation 20 Model 61 60/50 88.9 Apogee 2.3 102.8 Apogee 2.3 SPARCstation 20 Model 71 75/50 N/A   N/A  

Adding More Processors

Inthe SPARCstation 10 Model 402MP and the SPARCstation 20 Model 502MP,two SuperSPARC+ processors are directly connected to the memory, asshown in Figure 11-11.If one processor is running a program that doesn’t fit well in theon-chip caches and it makes heavy use of main memory, then the secondprocessor will have to wait longer for its memory accesses to completeand will not run at full speed. This diagram has been simplified toshow only the SPARC modules and the memory system; the symbol “$” isused, as is common, as an abbreviation for the word “cache” (cash).

Figure 11-11. SPARCstation 10 Model 402MP and SPARCstation 20 Model 502MP Cache Organization



In the Model 412MP, Model 512MP, Model 612MP, Model 712MP, and Model 514MP, each processor has its own SuperCache, as shown in Figure 11-12.This reduces the number of references to main memory from eachprocessor so that there is less contention for the available memorybandwidth. The 60 MHz processors run hotter than the 50 MHz ones, andcooling limits in the package prevent a Model 614MP from beingconfigured. When the 50 MHz parts are used in the SPARCstation 20, itautomatically senses and reduces its MBus clock rate from 50 MHz to 40MHz; this technique maintains the clock rate difference required forsynchronization, but it is unfortunate that the system that needs thefastest possible MBus doesn’t get it.
Figure 11-12. The SPARCsystem 10, 20, and 600 Model 514MP Cache Organization

SuperSPARC-Based Multiprocessor Hardware

Severalclasses of multiprocessor machines are based on SPARC architecture: 1-to 4-processor, MBus-based systems; high-end systems with up to 8, 20,or 64 SuperSPARC processors; 1- to 4-processor UltraSPARC-basedsystems; and high-end UltraSPARC servers with up to 30 and up to 64processors. Their system architectures are quite different and all arequite complex. I’ll start by discussing bus architectures in general,then look at the implementations of systems based on SuperSPARC. Then,I’ll describe the bus architectures and implementations used byUltraSPARC.

Bus Architectures Overview

Thereare two things to consider about bus performance. The peak data rate iseasily quoted, but the ability of the devices on the bus to source orsink data at that rate for more than a few cycles is the real limit toperformance. The second consideration is whether the bus protocolincludes cycles that do not transfer data, thus reducing the sustaineddata throughput.

Olderbuses like VMEbus usually transfer one word at a time, so each buscycle includes the overhead of deciding which device will access thebus next (arbitration) as well as setting up the address andtransferring the data. This procedure is rather inefficient, so morerecent buses like SBus, MBus, and UPA transfer data in blocks.Arbitration can take place once per block; then, a single address isset up and multiple cycles of data are transferred. The protocol givesbetter throughput if more data is transferred in each bus transaction.For example, SPARCserver 600MP systems are optimized for a standardtransaction size of 32 bytes by providing 32-byte buffers in all thedevices that access the bus and by using a 32-byte cache line. TheSPARCcenter 2000 and UltraSPARC-based systems are optimized for 64-bytetransactions, and UltraSPARC overlaps the address setup and arbitrationto occur in parallel with data transfers.

MP Cache Issues

Insystems that have more than one cache on the bus, a problem arises whenthe same data is stored in more than one cache and the data ismodified. A cache coherency protocol and special cache tag informationare needed to keep track of the data. The basic solution is for all thecaches to include logic that watches the transactions on the bus (knownas snooping the bus) and looks for transactions that use data that isbeing held by that cache. The I/O subsystem on Sun’s multiprocessormachines has its own MMU and cache, so that full bus snooping supportis provided for DVMA[3]I/O transfers. The coherency protocol uses invalidate transactions thatare sent from the cache that owns the data when the data is modified.This technique invalidates any other copies of that data in the rest ofthe system. When a cache tries to read some data, the data is providedby the owner of that data, which may not be the memory system, socache-to-cache transfers occur.

[3]DVMA stands for Direct Virtual Memory Access. It is used by intelligentI/O devices that write data directly into memory, using virtualaddresses, for example, the disk and network interfaces.

Circuit-switched Bus Protocols

Oneclass of bus protocols effectively opens a circuit between the sourceand destination of the transaction and holds on to the bus until thetransaction has finished and the circuit is closed. This protocol issimple to implement, but when a transfer from a slow device like mainmemory to a fast device like a CPU cache (a cache read) occurs, theremust be a number of wait states to let the main memory DRAM accesscomplete in the interval between sending the address to memory and thedata returning. These wait states reduce cache read throughput, andnothing else can happen while the circuit is open. The faster the CPUclock rate, the more clock cycles are wasted. On a uniprocessor, thiswait time just adds to the cache miss time, but on a multiprocessor thenumber of CPUs that a bus can handle is drastically reduced by the waittime. Note that a fast device like a cache can write data with nodelays to the memory system write buffer. MBus uses this type ofprotocol and is suitable for up to four CPUs.

Packet-switched Bus Protocols

Tomake use of the wait time, a bus transaction must be split into arequest packet and a response packet. This protocol is hard toimplement because the response must contain some identification and adevice, such as the memory system, on the bus may have to queue upadditional requests coming in while it is trying to respond to thefirst one.

Aprotocol called XBus extends the basic MBus interface to implement apacket-switched protocol in the SuperCache controller chip used withSuperSPARC. This extension provides more than twice the throughput ofMBus and is designed to be used in larger multiprocessor machines thathave more than four CPUs on the bus. The SPARCcenter 2000 uses XBuswithin each CPU board and multiple, interleaved XBuses on itsinterboard backplane. The backplane bus is called XDBus. TheSPARCserver 1000 has a single XDBus, and the SPARCcenter 2000 has atwin XDBus. The Cray SuperServer CS6400 has a quadruple XDBus runningat a higher clock rate.

Table 11-14. SuperSPARC Multiprocessor Bus Characteristics
Bus Name Clock Peak Bandwidth Read Throughput Write Throughput MBus 40 MHz 320 Mbytes/s 90 Mbytes/s 200 Mbytes/s XBus 40 MHz 320 Mbytes/s 250 Mbytes/s 250 Mbytes/s Single XDBus 40 MHz 320 Mbytes/s 250 Mbytes/s 250 Mbytes/s Dual XDBus 40 MHz 640 Mbytes/s 500 Mbytes/s 500 Mbytes/s Quad XDBus 55 MHz 1760 Mbytes/s 1375 Mbytes/s 1375 Mbytes/s

SuperSPARC XDBus Server Architecture

TheSuperSPARC server architecture was the first Sun product range toaddress the high end of the server marketplace. Two packages deliverthe architecture: one low-cost package that can sit on or beside a deskin an office environment, and one package that maximizes expansioncapacity, for the data center. These two packages contain scaledversions of the same architecture, and the investment in IC designs andoperating system porting and tuning is shared by both machines. Thesame architecture was used by Cray Research SuperServers, Inc., intheir much larger CS6400 package. This division of Cray was purchasedby Sun during 1996.

Akey theme of the SuperSPARC server architecture is that it is builtfrom a small number of highly integrated components that are replicatedto produce the required configuration. Several of these components,including processor modules and SBus I/O cards, are identical to thoseused in the high-volume workstation product line, which reducesper-processor costs and provides a wide choice of I/O options. Thesystem design that connects processors, I/O, and memory over multiple,high-speed buses is implemented in a small number of complexapplication-specific integrated circuits (ASICs); this implementationreduces the number of components on each board, easing manufacture,reducing cost, and increasing reliability.

Comparedto an MBus system with one to four processors connected to one memorybank and one SBus, the additional throughput of the SPARCserver 1000and SPARCcenter 2000 comes from the use of many more processorsinterconnected via substantially faster buses to multiple, independentmemory banks and multiple SBuses, as summarized in Table 11-15.

Table 11-15. SPARCserver System Expansion Comparisons
Machine SS 10, 20 SS 1000 SC 2000 CS6400 SuperSPARC Clock Rate 40,50,60,75 MHz 50,60,85 MHz 50,60,85 MHz 60,85 MHz CPU External Cache Optional 1 MB 1 MB 1 MB or 2 MB 2 MB Max Number of CPUs 4 8 20 64 Total Memory 32–512 MB 32–2048 MB 64–5120 MB 128 MB–16 GB Memory Banks 1 512 MB 4 x 512 MB 20 x 256 MB 64 x 256 MB SBus I/O Slots 4 at 20–25 MHz 12 at 20 MHz 40 at 20 MHz 64 at 25 MHz Independent SBuses 1@40–100 MB/s 4@50 MB/s 10@50 MB/s 16@60 MB/s Interconnect MBus XDBus 2 x XDBus 4 x XDBus Speed and Width 40, 50 MHz, 64 bits 40, 50 MHz, 64 bits 40, 50 MHz, 64 bits 55 MHz, 64 bits Interconnect Throughput 100-130 MB/s 250-310 MB/s 500-620 MB/s 1500 MB/s

SuperSPARC Server Architectural Overview

TheSuperSPARC server architecture is based on a small number of buildingblocks that are combined in three configurations to produce theSPARCserver 1000, SPARCcenter 2000, and CS6400.

SPARCserver 1000 System Implementation Overview

Thedesign objective for the SPARCserver 1000 was to take the architectureinto an office environment and to introduce a low-cost entry point intothe range. This goal was achieved by use of a very compact package,about the same size as an office laser printer, that can be put on adesktop, stacked on the floor, or rack-mounted.

TheSPARCserver 1000 system board contains twin CPU blocks sharing a singleBootBus, a single SBus I/O block, including an integrated FSBE/Sinterface, and a single memory bank on one XDBus. The backplane acceptsup to four system boards for a total of 8 CPUs, 16 SBus slots, and 2048Mbytes of RAM. Figure 11-13 illustrates the configuration.

Figure 11-13. SPARCserver 1000 Configuration


SPARCcenter 2000 System Implementation Overview

Takingthe three basic building blocks already described, the SPARCcenter 2000system board contains a dual CPU block sharing a single BootBus, asingle SBus I/O block and a single memory bank on each XDBus. Theentire system board contains only 19 highly integrated ASICs: 9 large100-K gate ASICs and 10 much smaller chips. The backplane accepts up to10 system boards for a total of 20 CPUs, 40 SBus slots, and 5120 Mbytesof RAM. The SPARCcenter 2000 uses a modified form of the Sun rack-mountserver packaging that was used on the previous generation SPARCserver690. Figure 11-14 illustrates the configuration.

Figure 11-14. SPARCcenter 2000 System Board Configuration


Interrupt Distribution Tuning

Onthe SPARCserver 1000 and SPARCcenter 2000, there are up to 10independent SBuses, and there is hardware support for steering theinterrupts from each SBus to a specific processor.

Thealgorithm used in Solaris 2.2 permanently assigns the clock interruptto a CPU to obtain good cache hit rates in a single cache. The clockpresents a relatively light and fixed load at 100 Hz, so this load doesnot significantly unbalance the system. To balance the load across allthe other CPUs, a round-robin system is used, whereby all interruptsare directed to one CPU at a time. When the CPU takes the firstinterrupt, it sends a special broadcast command over the XDBus to allthe SBus controllers to direct the next interrupt to the next CPU. Thisscheme balances the load, but when there is a heavy interrupt load froma particular device, it is less efficient from the point of view ofcache hit rate.

Thealgorithm can be switched to a static interrupt distribution, wherebyeach SBus device is assigned to a different CPU. For some I/O-intensiveworkloads, this scheme has given better performance, and it is thedefault in Solaris 2.3 and later releases.

A kernel variable,do_robin,controls interrupt distribution and defaults to 1 in the sun4d kernelarchitecture of Solaris 2.2, and to 0 in 2.3 and later releases. If itis set to 0 in/etc/system, then the static interrupt distribution algorithm is used.

Console Terminal and Clock Interrupt Issues

SinceSolaris expects to have a single console terminal, the board to whichthe console is connected is designated as the master CPU during thefirst power-on boot. One of the CPUs on this board always takes the 100Hz clock interrupt. If the system ever finds it has no master at boottime, then it selects the lowest numbered board that has a working CPU.It is best to keep this as board zero because the file/etc/iu.aponly has autopush stream entries for the ports on this board and theinstallation is hard to use on a raw stream device. The commands tomanually push streams (if you find that you have no echo on the consolein single-user boot mode) are shown in Figure 11-15.The first line must be typed blind with a control-J terminator. Thissounds obscure, but it took me ages to work out why an SC2000 with theconsole on board two did this, so it is worth sharing!

Figure 11-15. How to Manually Push Streams on an Alternative Console Port
# strchg -h ldterm^J 
# strchg -h ttcompat

Usingprtdiag to Show the Configuration

The system configuration can be seen on SPARCserver1000 and SPARCcenter2000 machines with the/usr/kvm/prtdiag or/usr/platform/sun4d/sbin/prtdiag command; see “The Openboot Device Tree — prtconf and prtdiag”on page 438. The following output shows the configuration of an8-processor SPARCserver 1000 with 384 Mbytes of RAM and a 4-Mbyte bankof NVSIMM. The NVSIMM is incorrectly sensed as 8 Mbytes by theconfiguration software, but this does not cause any problems inoperation. A substantial number of disk and network interfaces areconfigured on this system. The SCSI bus cards are a mixture of normalesp interfaces and wide SCSIisp interfaces. The network cards are a mixture of several buffered Ethernetle and an FDDIbf. To help identify each card, the vendor identity and part number are given for each one. Table 11-16 provides an example.

Table 11-16. Example Output fromPrtdiag for a SPARCserver 1000
Code View: Scroll / Show All
% prtdiag 
System Configuration: Sun Microsystems sun4d SPARCserver 1000
System clock frequency: 40 MHz
Memory size: 392Mb
Number of XDBuses: 1
CPU Units: Frequency Cache-Size Memory Units: Group Size
A: MHz MB B: MHz MB 0: MB 1: MB 2: MB 3: MB
--------- --------- ----- ----- ----- -----
Board0: 50 1.0 50 1.0 32 32 32 32
Board1: 50 1.0 50 1.0 32 32 32 32
Board2: 50 1.0 50 1.0 32 32 32 32
Board3: 50 1.0 50 1.0 8 0 0 0
======================SBus Cards========================================
Board0: 0: dma/esp(scsi) 'SUNW,500-2015'
lebuffer/le(network) 'SUNW,500-2015'
1:
2: bf 'SUNW,501-1732'
3: QLGC,isp/sd(block) 'QLGC,ISP1000'
Board1: 0: dma/esp(scsi) 'SUNW,500-2015'
lebuffer/le(network) 'SUNW,500-2015'
1: dma/esp(scsi) 'SUNW,500-2015'
lebuffer/le(network) 'SUNW,500-2015'
2:
3:
Board2: 0: dma/esp(scsi) 'SUNW,500-2015'
lebuffer/le(network) 'SUNW,500-2015'
1:
2: QLGC,isp/sd(block) 'QLGC,ISP1000'
3:
Board3: 0: dma/esp(scsi) 'SUNW,500-2015'
lebuffer/le(network) 'SUNW,500-2015'
1:
2:
3:

No failures found in System
===========================


UltraSPARC Interconnect Architectures

TheUltraSPARC I and UltraSPARC II processors have far higher bandwidthrequirements than do the previous generation SuperSPARC systems. Acompletely new approach to interconnect technologies was developed tosupport them. In this section, I first describe the interconnectarchitectures, followed by the system implementations and packaging.

The UltraSPARC Port Architecture Switched Interconnect

Busarchitectures like MBus (and many other buses in use today) use eachwire to carry data and address information from every device on thebus. This use makes the wire long and makes the protocol complex. Italso reduces data bandwidth because the wire has to carry addressinformation in between the data transfers. The bandwidth required bydifferent types of bus device is also compromised. Memory and CPUdevices need a very wide bus, and I/O devices that do not need as muchbandwidth have to implement a full-width bus interface, and thisrequirement pushes up costs. The UltraSPARC Port Architecture (UPA),shown in Figure 11-17 on page 293, is a very neat solution to all these problems.

Figure 11-17. The UltraSPARC Port Architecture (UPA) Ultra 1, 2, and 30 Implementation


Inthe networking arena, shared Ethernets have been replaced by switchedEthernet, with some ports on the switch at 10 Mbit/s and some at 100Mbit/s. The move from a shared bus architecture to the UPA switch hasthe same benefits. Low-speed devices can use a low-cost UPA port, andall devices have their own dedicated wire to the switch, so there isless contention and shorter wire lengths. Shorter wires can be clockedfaster, so the UPA clock rate of 83 MHz to 100 MHz is easier toachieve. Unlike a network switch, the UPA separates the addresses anddata. Addresses are fed to a system controller so that cache coherencycan be maintained, and the system controller manages a queue ofrequests and sequences the data switch. All ports on the UPA operate atthe same clock rate but have different widths. Some master ports such as CPUs and I/O controllers can initiate a transfer, whereas slave portssuch as memory and framebuffer devices only need to respond and aresimpler to implement. All ports connect to 64-byte, cache-line-sizedbuffers in the data switch. Several external transfers may be needed tofill a buffer from a narrow port. When the buffer is full, it isswitched to the destination port in a single cycle. In each UPA cycle(at 83 or 100 MHz), 16 bytes can be switched from one port’s outputbuffer to another port’s input buffer, so four cycles are used totransfer a 64-byte cache line. The next four cycles can be used totransfer a different cache line. The overall data bandwidth is 16 bytesat 100 MHz or 1.6 Gbytes/s. All data carries error correction bits, sothe switch is really 144 bits wide rather than 128 bits.

Unlikethe case with most other buses, there are no dead cycles betweentransfers and no address transfers to get in the way. All addresstransfers occur in parallel on separate wires. All signals are eithererror corrected or parity protected. There is one address transfer forevery four data transfers, and the system design is really limited bythe rate at which addresses can be issued and arbitrated (25 millionaddresses/s). Each device has its own dedicated connection to a port,so slow devices do not interfere with fast devices, and all these busescan be transferring at once. There is very little contention underheavy load, so the latency of the UPA stays low longer than bus-basedalternatives.

The Gigaplane Ultra Enterprise Server Backplane

TheUltra Enterprise Server range is a configurable, multiple-board-basedsystem. It uses a passive backplane to keep the package and entry-levelconfiguration costs down. The Gigaplane bus connects the boards with a256-bit-wide data path and a separate address path. It is a remarkablyefficient design that has no wasted clock cycles and approaches thelimits set by the laws of physics. Gigaplane is a distributed extensionof the techniques used by the UPA design.

Eachaddress is broadcast to the entire system in one clock cycle, and thesnoop results are returned in the next cycle. A 64-byte cache line isalso transferred over the data path in two cycles. Again, there are nodead cycles on the bus, even when the direction of transfer is changed.Two-cycle arbitration with no dead cycles is a very aggressiveimplementation (some competing systems take five cycles and need a deadcycle between each phase before the bus is driven in the oppositedirection). The result is a capacity of 41 million addresses/s at 82MHz. This is a key measure for large SMP systems, known as thecoherency bandwidth. The data bandwidth is the coherency bandwidthmultiplied by the cache line size of 64 bytes, which is 2,624,000,000bytes/s. Data rates of over 2500 Mbytes/s have been measured on thesesystems.

TheGigaplane also has some special optimizations that make memory sharingand spin locks more efficient for the E3000, E4000, E5000, and E6000.

The Gigaplane XB Crossbar

TheE10000 Starfire is also a configurable, multiple-board-based system,but it has an active backplane. This means that the entry-levelconfiguration includes the full cost of the complex circuitry thatimplements the crossbar interconnect. This design decision is one ofthe key differentiators between the two server architectures. Theactive backplane adds to the throughput and flexibility of the designand, unfortunately, also adds to the system cost and memory latency.

Insome ways, Gigaplane XB is more like the entry-level UPA design, inthat it uses point-to-point 128-bit-wide buses, all connected to portson a central switch. The essential difference is that while UPAtransfers one block of data in each clock cycle between any pair ofports, Gigaplane XB moves eight blocks of data at the same time betweenpairs of ports, using a 16-port crossbar switch rather than the singleswitch of UPA. At the same clock rate, the data bandwidth of GigaplaneXB is eight times that of the entry-level UPA-based systems, and fourtimes that of the Gigaplane. Since it uses short point-to-point wiring,there is also more headroom in the design (it is easier to achieve highclock rates) and better fault isolation. The entire system was designedto run at 100 MHz, but it shares its CPU modules with the rest of theenterprise server range and operates at a nominal clock rate of 83 MHzwith the 250 MHz module.

Achallenge for Gigaplane XB is to obtain enough address bus coherencebandwidth to keep its crossbar busy. It does this by using fourinterleaved global address buses with two-cycle arbitration. Thisapproach matches the cache line transfer rate, which takes four cycles,with eight occurring in parallel. It thus has four times the bandwidthof the Gigaplane, at 167 million addresses/s at 83 MHz. With 64-bytecache lines, this bandwidth works out to 10.6 million bytes/s. Whenconverting to megabytes and gigabytes, take care that you are usingconsistent definitions as either a power of 10 or a power of 2throughout your calculations. Exact clock rates for your system can beobtained fromprtconf, as described in “The Openboot Device Tree — prtconf and prtdiag” on page 438. Table 11-17 shows exact measured rates taken fromprtdiag, rather than the nominal rates, and approximate latency.

Table 11-17. UltraSPARC Multiprocessor Bus Characteristics
Bus Name Bus Clock CPU Clock Addresses Data bytes/s Latency UPA 84.0 MHz 168 MHz 21 MHz 1,344,000,000 170 ns UPA 82.0 MHz 248 MHz 20.5 MHz 1,312,000,000 170 ns UPA 100.0 MHz 200, 300 MHz 25 MHz 1,600,000,000 160 ns Gigaplane 82 MHz 248 41 MHz 2,624,000,000 270 ns Gigaplane XB 83,294,357 Hz 249,883,071 Hz 166,588,714 Hz 10,661,677,696 400 ns Gigaplane XB 100 MHz N/A 200 MHz 12,800,000,000 N/A

Mainmemory latency for a cache miss is about 400 ns with the Gigaplane XB.It is higher than for the Gigaplane because there is an extra level ofswitching circuitry to pass through. The Gigaplane XB data path is 144bits wide, rather than 288 bits wide as for Gigaplane, so transferstake four rather than two cycles. The Gigaplane spin lock optimizationsare not used by the E10000, which was designed by the Cray SuperServersteam before Sun acquired their organization. In effect, when runningwithin the bandwidth limitations of the Gigaplane, the Gigaplane ishigher performance than the Gigaplane XB. The full bandwidth of theGigaplane XB is available only in a maximum configuration. It increasesas pairs of boards are added to a domain, as shown in Table 11-18.An odd number of boards does not add any bandwidth, as the odd boardhas nothing to communicate with. Comparing with the Gigaplane, you cansee that up to 20 CPUs, Gigaplane should have a performance advantage;above 20 CPUs, Gigaplane XB has significantly higher bandwidth. Asdescribed “The E10000 Starfire Implementation”on page 297, the E10000 also adds more I/O capability on each board,whereas Gigaplane-based systems have to reduce the number of I/O boardsto fit in more CPUs. If you are interested in an E10000 for itsultimate performance, you should configure it into a single largedomain. If you want flexibility in a single rack, the performance of asmall domain is very good, but you need to take into account that itwill be lower than a dedicated system with the same number of CPUs.

Table 11-18. Data Bandwidth Variation with Domain Size
Domain Size CPU Count Data Bandwidth Mbyte/s UPA 1-4 1600 Gigaplane 1 to 30 2624 Gigaplane XB 1 to 3 boards 1-12 1333 Gigaplane XB 4 to 5 boards 13-20 2666 Gigaplane XB 6 to 7 boards 21-28 4000 Gigaplane XB 8 to 9 boards 29-36 5331 Gigaplane XB 10 to 11 boards 37-44 6664 Gigaplane XB 12 to 13 boards 45-52 7997 Gigaplane XB 14 to 15 boards 53-60 9330 Gigaplane XB 16 boards 61-64 10662

UltraSPARC III Interconnect Considerations

Thefirst details about UltraSPARC III were disclosed at the MicroprocessorForum event in October 1997. The press release, available on Sun’s website, contains the following details: the target CPU clock rate is 600MHz, and the per-CPU bandwidth is 2.4 Gbytes/s, with performance two tothree times higher than that of UltraSPARC II.

Noneof the above interconnects can interface to a CPU with this muchbandwidth. In the meantime, with higher-performance UltraSPARC II CPUmodules and excellent SMP scalability to large numbers of processors,there is plenty of room for growth in the current product range.

UltraSPARC System Implementations

Thereare five UltraSPARC I- and UltraSPARC II-based system implementationsto discuss. The most recent and simplest is the highly integrateduniprocessor UltraSPARC IIi, a successor to the microSPARC. The initialUltraSPARC workstations are based on a simple UPA configuration. Anexpanded UPA configuration is used in the workgroup server systems. TheEnterprise Server range uses a Gigaplane bus to link large numbers ofCPUs and I/O buses. The E10000 Starfire uses an active backplanecrossbar to support the very biggest configurations.

The Ultra 5 and Ultra 10

TheUltraSPARC IIi provides a highly integrated design for buildinglow-cost systems in the same way as its microSPARC predecessor. The CPUchip includes a complete memory controller, external cache controller,I/O bus controller (32-bit PCIbus this time) and a UPA slave interfacefor Creator graphics options. Figure 11-16 illustrates the architecture.

Figure 11-16. UltraSPARC IIi-Based, Entry-Level Ultra 5, Ultra 10 System Architecture


The Ultra 1, Ultra 2 and Ultra 30, Ultra 60

The UPA configuration of these systems is shown in Figure 11-17.The address connections are omitted, leaving just the data pathsthrough the port. ECC adds to the real width of each port, making thesizes 72, 144, and 288 bits, to carry 64, 128, and 256 bits of data.

Memoryis a problem because DRAM is slow to respond, compared to otherdevices. It is given a double-width port, which is loaded at half therate (i.e., 50 MHz) from a pair of memory modules. Since a full cacheline is 64 bytes, or 576 bits including ECC, a second pair of memorymodules is set up to transfer immediately afterward in the dual-CPU,Ultra 2 version of this design. This interleaving reduces latency incase the other CPU wishes to access the memory port immediately. TheUltra 30 and Ultra 60 support a second UPA graphics interface forCreator or Elite graphics options. These systems benefit from the lowlatency of the UPA but do not come close to using up its bandwidth. Thedesign is limited by memory bandwidth to around 300–500 Mbytes/s. Mainmemory latency in this configuration is 28 cycles (168 ns) for a 167MHz CPU with 83 MHz UPA. Latency in nanoseconds is a little better witha 100 MHz UPA, but the number of clock cycles gets worse as the CPUspeed increases—approximately 30 cycles at 200 MHz, 40 at 250 MHz, and50 at 300 MHz.

TheUltra 30 is packaged as a PCIbus tower system, closer to a PCconfiguration than to the traditional Sun “pizza box” used in the Ultra1 and Ultra 2. The Ultra 60 is a dual-CPU version of the Ultra 30.Their architecture is the same as that of the Ultra 2 but with thesubstitution of a PCIbus controller for the SBus controller used inprevious designs. This UPA-to-PCI controller is used in many UltraSPARCsystems. It has two PCIbuses: one is a high-speed, 64-bit @ 66 MHz buswith a single slot; the other is a 32/64-bit @ 33 MHz bus that cansupport several slots.

Larger UPA-Based Systems: The Enterprise 450

Toextend the design to cope with the demands of the four-processorEnterprise 45, a more complex switch was developed with twice as manyports, as shown in Figure 11-18.The configuration includes not only four processor ports but also twoindependent memory ports, two graphics ports, and two I/O bus ports,each supporting dual PCI buses for a total of four. This designrepresents the largest balanced configuration that can be built on asingle motherboard, yet it still does not saturate the throughputcapabilities of the UPA switch. It retains the low latency of the entrylevel designs, while providing twice the CPU power and twice the memoryand I/O bandwidth.

Figure 11-18. The E450 UltraSPARC Port Architecture (UPA) Switched Interconnect


Memoryis accessed four DIMMs at a time, and additional performance comes frominterleaving the four banks of four DIMMs. The interleaving rules arevery restrictive, and for best performance, the simple rule is toalways use the same size DIMM throughout. A mixture of DIMM capacitieswill prevent interleaving. Two banks of identical DIMMS, with two emptybanks, will be two-way interleaved. Four banks of identical DIMMs willbe four-way interleaved. Any other combination prevents anyinterleaving. This is unlikely to be an issue on systems with less thanfour CPUs, but a high-end configuration that is heavily loaded shouldat least have two-way interleaving, and preferably four.

Thelow latency of the E450 gives it significantly higher performance thana four-CPU Enterprise Server at substantially lower cost. The scope forexpansion is limited, however, so next we will look at the multiboardUltra Enterprise Server Architecture.

The Ultra Enterprise Server Architecture

Boards come in two types, CPU+Memory and I/O+Framebuffer. The CPU+Memory board shown in Figure 11-19allows memory to be added to the system in proportion to the CPU powerconfigured. Some competing designs use separate CPU and memory boardsand suffer from having to trade off CPU for memory in bigconfigurations.

Figure 11-19. The Ultra Enterprise Server CPU+Memory Board Implementation


The three different designs of separate I/O+Framebuffer board shown in Figure 11-20allow systems to be configured for CPU-intensive or I/O-intensiveworkloads and also allow a mixture of I/O types (SBus, PCIbus, andSBus+Creator Framebuffer) to be configured as needed.

Figure 11-20. The Ultra Enterprise Server I/O Board Implementation


Thesetwo board types are combined into three basic packages: the four-boardE3000; the eight-board E4000, which is called the E5000 whenrack-mounted; and the sixteen-board E6000. The larger systems use acenterplane, with boards plugging in horizontally from both the frontand back of the chassis, as shown in Figure 11-21.The E3000 has boards numbered 1, 3, 5, 7 only, the E4000 has boardsnumbered 0 to 7 inclusive, and the E6000 has boards numbered 0–15.

Figure 11-21. Enterprise Server Board Configuration: Side View with Numbering


The E10000 Starfire Implementation

TheE10000 uses a single, large-format board type, with removable daughtercards that can configure SBus or PCIbus options for I/O. Each boardincludes four CPU modules, four memory banks, and two I/O buses, asshown in Figure 11-22. The packaging takes up to sixteen boards, connected as described in “The Gigaplane XB Crossbar” on page 289. More information is available from the “Ultra Enterprise 10000 System Overview White Paper” and other papers at http://www.sun.com/servers/ultra_enterprise/10000/wp.

Figure 11-22. The E10000 Starfire Implementation


Thefour memory banks are interleaved. Each memory bank has its own globaladdress bus that maintains coherency with that bank on other boards.

Eachboard can function independently, and single boards or groups of boardscan be configured into domains. (Domain memory bandwidth is describedin Table 11-18on page 291.) The system can boot multiple copies of Solaris, one perdomain. The crossbar switch is configured so that data transfersbetween boards in one domain do not affect the performance of any otherdomain. There is also very good fault isolation between domains.Hardware failures are not propagated. Domains can be changed on linevia dynamic reconfiguration.