[Cockcroft98] Chapter 10. Processors

来源:百度文库 编辑:神马文学网 时间:2024/04/29 04:49:48

Chapter 10. Processors

Previouschapters have looked at the disk and network components of a system.This chapter covers the processor component. The details of particularprocessors are dealt with later. This chapter explains the measurementsthat each version of Solaris makes on a per processor basis. It alsodiscusses the trade-offs made by different multiprocessor architectures.

Monitoring Processors

This section lists the most important commands that you should know and explains where their data comes from.

CPU Control Commands —psrinfo,psradm, andpsrset

psrinfo tells you which CPUs are in use and when they were last enabled or disabled.psradmactually controls the CPUs. Note that in server systems, interrupts arebound to specific CPUs to improve the cache hit rate, and clockinterrupts always go to CPU 0. Even if a CPU is disabled, it stillreceives interrupts. Thepsrset command manages processor sets in Solaris 2.6.

The Load Average

Theload average is the sum of the run queue length and the number of jobscurrently running on CPUs. The three figures given are averages overthe last 1, 5, and 15 minutes. They can be used as an average run queuelength indicator to see if more CPU power is required.

% uptime 
11:40pm up 7 day(s), 3:54, 1 user, load average: 0.27, 0.22, 0.24

The Data Behindvmstat andmpstat Output

What do all those columns of data invmstat mean? How do they relate to the data frommpstat, and where does it all come from?

First, let’s look at Figure 10-1 to remind ourselves whatvmstat itself looks like.

Figure 10-1. Examplevmstat Output
Code View: Scroll / Show All
% vmstat 5 
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s2 s3 in sy cs us sy id
0 0 0 72724 25348 0 2 3 1 1 0 0 0 0 1 0 63 362 85 1 1 98
0 0 0 64724 25184 0 24 56 0 0 0 0 0 0 19 0 311 1112 356 2 4 94
0 0 0 64724 24796 0 5 38 0 0 0 0 0 0 15 0 92 325 212 0 1 99
0 0 0 64680 24584 0 12 106 0 0 0 0 0 0 41 0 574 1094 340 2 5 93
0 1 0 64632 23736 0 0 195 0 0 0 0 0 0 66 0 612 870 270 1 7 92
0 0 0 64628 22796 0 0 144 0 0 0 0 0 0 59 0 398 764 222 1 8 91
0 0 0 64620 22796 0 0 79 0 0 0 0 0 0 50 0 255 1383 136 2 18 80


Thecommand prints the first line of data immediately, then every fiveseconds prints a new line that gives the average rates over thefive-second interval. The first line is also the average rate over theinterval that started when the system was booted because the numbersare stored by the system as counts of the number of times each eventhas happened. To average over a time interval, you measure the countersat the start and end and divide the difference by the time interval.For the very first measure, there are zero counters to subtract, so youautomatically get the count since boot, divided by the time since boot.The absolute counters themselves can be seen with another option tovmstat, as shown in Figure 10-2.

Figure 10-2. Examplevmstat -s Raw Counters Output
Code View: Scroll / Show All
% vmstat -s 
0 swap ins
0 swap outs
0 pages swapped in
0 pages swapped out
208724 total address trans. faults taken
45821 page ins
3385 page outs
61800 pages paged in
27132 pages paged out
712 total reclaims
712 reclaims from free list
0 micro (hat) faults
208724 minor (as) faults
44636 major faults
34020 copy-on-write faults
77883 zero fill page faults
9098 pages examined by the clock daemon
1 revolutions of the clock hand
27748 pages freed by the clock daemon
1333 forks
187 vforks
1589 execs
6730851 cpu context switches
12848989 device interrupts
340014 traps
28393796 system calls
285638 total name lookups (cache hits 91%)
108 toolong
159288 user cpu
123409 system cpu
15185004 idle cpu
192794 wait cpu


The other closely related command ismpstat, which shows basically the same data but on a per-CPU basis. Figure 10-3 shows some output on a dual CPU system.

Figure 10-3. Examplempstat Output
Code View: Scroll / Show All
% mpstat 5 
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1 0 4 82 17 43 0 5 1 0 182 1 1 1 97
2 1 0 3 81 17 42 0 5 2 0 181 1 1 1 97
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 2 0 39 156 106 42 0 5 21 0 30 0 2 61 37
2 0 0 0 158 106 103 5 4 8 0 1704 3 36 61 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 19 28 194 142 96 1 4 18 0 342 1 8 76 16
2 0 6 11 193 141 62 4 4 10 0 683 5 15 74 6
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 22 33 215 163 87 0 7 0 0 287 1 4 90 5
2 0 22 29 214 164 88 2 8 1 0 304 2 5 89 4


The Sources ofvmstat Data

The data is read from the kernel statistics interface that is described in “The Solaris 2“kstat” Interface”on page 387. Most of the data is maintained on a per-CPU basis by thekernel and is combined into the overall summaries by the commandsthemselves. The samekstat(3K)programming interface that is used to get at the per-disk statistics isused. It is very lightweight—it takes only a few microseconds toretrieve each data structure. The data structures are based on thosedescribed in the file/usr/include/sys/sysinfo.h, the system information header file. Of course, all the rawkstat data is directly available to the SE toolkit, which contains customized scripts; see “vmstat.se” on page 500 and “mpvmstat.se” on page 485.

Process Queues

As we follow through the fields ofvmstat, we see that the first one is labeled procs, r, b, w. This field is derived from thesysinfo data, and there is a single globalkstat with the same form as thesysinfo structure shown in Figure 10-4.

Figure 10-4. Thesysinfo Process Queue Data Structure
typedef struct sysinfo {        /* (update freq) update action          */ 
ulong updates; /* (1 sec) ++ */
ulong runque; /* (1 sec) += num runnable procs */
ulong runocc; /* (1 sec) ++ if num runnable procs > 0 */
ulong swpque; /* (1 sec) += num swapped procs */
ulong swpocc; /* (1 sec) ++ if num swapped procs > 0 */
ulong waiting; /* (1 sec) += jobs waiting for I/O */
} sysinfo_t;

Asthe comments indicate, the field is updated once per second. The runqueue counts the number of runnable processes that are not running. Theextra data values, calledrunocc andswpocc, are displayed bysar -q and show the occupancy of the queues. Solaris 2 counts the total number of swapped-out idle processes, so if you see any swapped jobs registered inswpque, don’t be alarmed.sar -q is strange: If the number to be displayed is zero,sar -qdisplays nothing at all, just white space. If you add the number ofprocesses actually running on CPUs in the system, then you get the samemeasure on which the load average is based.waiting counts how many processes are waiting for a block device transfer (a disk I/O) to complete. It shows up as theb field invmstat.

Virtual Memory Counters

The next part ofvmstat lists the free swap space and memory. This information is obtained as akstat from a single globalvminfo structure, as shown in Figure 10-5.

Figure 10-5. Thevminfo Memory Usage Data Structure
typedef struct vminfo {         /* (update freq) update action          */ 
longlong_t freemem; /* (1 sec) += freemem in pages */
longlong_t swap_resv; /* (1 sec) += reserved swap in pages */
longlong_t swap_alloc; /* (1 sec) += allocated swap in pages */
longlong_t swap_avail; /* (1 sec) += unreserved swap in pages */
longlong_t swap_free; /* (1 sec) += unallocated swap in pages */
} vminfo_t;

The only swap number shown byvmstat isswap_avail,which is the most important one. If this number ever goes to zero, yoursystem will hang up and be unable to start more processes! For somestrange reason,sar -r reportsswap_free instead and converts the data into useless units of 512-byte blocks. The bizarre state of thesar command is one reason why we were motivated to create and develop the SE toolkit in the first place!

Thesemeasures are accumulated, so you calculate an average level by takingthe values at two points in time and dividing the difference by thenumber of updates.

Paging Counters

The paging counters shown in Figure 10-6 are maintained on a per-cpu basis; it’s also clear thatvmstat andsar don’t show all of the available information. The states and state transitions being counted are described in detail in “The Life Cycle of a Typical Physical Memory Page” on page 326.

Figure 10-6. Thecpu_vminfo Per CPU Paging Counters Structure
Code View: Scroll / Show All
typedef struct cpu_vminfo {
ulong pgrec; /* page reclaims (includes page-out) */
ulong pgfrec; /* page reclaims from free list */
ulong pgin; /* page-ins */
ulong pgpgin; /* pages paged in */
ulong pgout; /* page-outs */
ulong pgpgout; /* pages paged out */
ulong swapin; /* swap-ins */
ulong pgswapin; /* pages swapped in */
ulong swapout; /* swap-outs */
ulong pgswapout; /* pages swapped out */
ulong zfod; /* pages zero filled on demand */
ulong dfree; /* pages freed by daemon or auto */
ulong scan; /* pages examined by page-out daemon */
ulong rev; /* revolutions of the page daemon hand */
ulong hat_fault; /* minor page faults via hat_fault() */
ulong as_fault; /* minor page faults via as_fault() */
ulong maj_fault; /* major page faults */
ulong cow_fault; /* copy-on-write faults */
ulong prot_fault; /* protection faults */
ulong softlock; /* faults due to software locking req */
ulong kernel_asflt; /* as_fault()s in kernel addr space */
ulong pgrrun; /* times pager scheduled */
} cpu_vminfo_t;


Afew of these might need some extra explanation. Protection faults occurwhen a program tries to access memory it shouldn’t, gets a segmentationviolation signal, and dumps a core file. Hat faults occur only onsystems that have a software-managed memory management unit (sun4c andsun4u). Hat stands for hardware address translation.

I’ll skip the disk counters becausevmstat is just reading the samekstat data asiostat and providing a crude count of the number of operations for the first four disks.

CPU Usage and Event Counters

Let’s remind ourselves again whatvmstat looks like.

Code View:Scroll/Show All
% vmstat 5 
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s2 s3 in sy cs us sy id
0 0 0 72724 25348 0 2 3 1 1 0 0 0 0 1 0 63 362 85 1 1 98


Thelast six columns show the interrupt rate, system call rate, contextswitch rate, and CPU user, system, and idle time. The per-cpu structurefrom which these are derived is the biggest structure yet, with about60 values. Some of them are summarized bysar, but a lot of interesting stuff here is being carefully recorded by the kernel, is read byvmstat, and is then just thrown away. Look down the comments in Figure 10-7,and at the end I’ll point out some nonobvious and interesting values.The cpu and wait states are arrays holding the four CPUstates—usr/sys/idle/wait—and three wait states, io/swap/pio. Only theio wait state is implemented, so this measurement is superfluous. (Imade thempvmstat.se script print out the wait states before I realized that they were always zero.)

Figure 10-7. Thecpu_sysinfo Per-CPU System Information Structure
Code View: Scroll / Show All
typedef struct cpu_sysinfo {
ulong cpu[CPU_STATES]; /* CPU utilization */
ulong wait[W_STATES]; /* CPU wait time breakdown */
ulong bread; /* physical block reads */
ulong bwrite; /* physical block writes (sync+async) */
ulong lread; /* logical block reads */
ulong lwrite; /* logical block writes */
ulong phread; /* raw I/O reads */
ulong phwrite; /* raw I/O writes */
ulong pswitch; /* context switches */
ulong trap; /* traps */
ulong intr; /* device interrupts */
ulong syscall; /* system calls */
ulong sysread; /* read() + readv() system calls */
ulong syswrite; /* write() + writev() system calls */
ulong sysfork; /* forks */
ulong sysvfork; /* vforks */
ulong sysexec; /* execs */
ulong readch; /* bytes read by rdwr() */
ulong writech; /* bytes written by rdwr() */
ulong rcvint; /* XXX: UNUSED */
ulong xmtint; /* XXX: UNUSED */
ulong mdmint; /* XXX: UNUSED */
ulong rawch; /* terminal input characters */
ulong canch; /* chars handled in canonical mode */
ulong outch; /* terminal output characters */
ulong msg; /* msg count (msgrcv()+msgsnd() calls) */
ulong sema; /* semaphore ops count (semop() calls) */
ulong namei; /* pathname lookups */
ulong ufsiget; /* ufs_iget() calls */
ulong ufsdirblk; /* directory blocks read */
ulong ufsipage; /* inodes taken with attached pages */
ulong ufsinopage; /* inodes taken with no attached pages */
ulong inodeovf; /* inode table overflows */
ulong fileovf; /* file table overflows */
ulong procovf; /* proc table overflows */
ulong intrthread; /* interrupts as threads (below clock) */
ulong intrblk; /* intrs blkd/preempted/released (swtch) */
ulong idlethread; /* times idle thread scheduled */
ulong inv_swtch; /* involuntary context switches */
ulong nthreads; /* thread_create()s */
ulong cpumigrate; /* cpu migrations by threads */
ulong xcalls; /* xcalls to other cpus */
ulong mutex_adenters; /* failed mutex enters (adaptive) */
ulong rw_rdfails; /* rw reader failures */
ulong rw_wrfails; /* rw writer failures */
ulong modload; /* times loadable module loaded */
ulong modunload; /* times loadable module unloaded */
ulong bawrite; /* physical block writes (async) */
/* Following are gathered only under #ifdef STATISTICS in source */
ulong rw_enters; /* tries to acquire rw lock */
ulong win_uo_cnt; /* reg window user overflows */
ulong win_uu_cnt; /* reg window user underflows */
ulong win_so_cnt; /* reg window system overflows */
ulong win_su_cnt; /* reg window system underflows */
ulong win_suo_cnt; /* reg window system user overflows */
} cpu_sysinfo_t;


Some of the numbers printed bympstat are visible here. Thesmtx value used to watch for kernel contention ismutex_adenters. Thesrw value is the sum of the failures to obtain a readers/writer lock. The termxcalls is shorthand for cross-calls. A cross-call occurs when one CPU wakes up another CPU by interrupting it.

vmstatprints out 22 columns of numbers, summarizing over a hundred underlyingmeasures (even more on a multiprocessor). It’s good to have a lot ofdifferent things summarized in one place, but the layout ofvmstat (andsar) is as much a result of their long history as it is by design. Could you do better?

I’mafraid I’m going to end up plugging the SE toolkit again. It’s just soeasy to get at this data and do things with it. All the structures canbe read by any user, with no need to be setuid root. (This is a keyadvantage of Solaris 2; other Unix systems read the kernel directly, soyou would have to obtain root permissions.)

If you want to customize your very ownvmstat, you could either write one from scratch in C, using thekstat library as described in “The Solaris 2 “kstat”Interface” on page 387, or you could load the SE toolkit and spend afew seconds modifying a trivial script. Either way, if you come up withsomething that you think is an improvement, while staying with thebasic concept of one line of output that fits in 80 columns, send it tome, and we’ll add it to the next version of SE.

Ifyou are curious about how some of these numbers behave but can’t bebothered to write your own SE scripts, you should try out the GUI frontend to the rawkstat data that is provided in the SE toolkit asinfotool.se. A sample snapshot is shown in Figure 16-12 on page 480.

Use ofmpstat to Monitor Interrupts and Mutexes

Thempstatoutput shown below was measured on a 16-CPU SPARCcenter 2000 runningSolaris 2.4 and with about 500 active time-shared database users. Oneof the key measures issmtx, thenumber of times the CPU failed to obtain a mutex immediately. If thenumber is more than about 200 per CPU, then usually system time beginsto climb. The exception is the master CPU that is taking the 100 Hzclock interrupt, normally CPU 0 but in this case CPU 4. It has a largernumber (1,000 or more) of very short mutex stalls that don’t hurtperformance. Higher-performance CPUs can cope with highersmtxvalues before they start to have a problem; more recent releases ofSolaris 2 are tuned to remove sources of mutex contention. The best wayto solve high levels of mutex contention is to upgrade to at leastSolaris 2.6, then use thelockstat command to find which mutexes are suffering from contention, as described in “Monitoring Solaris 2.6 Lock Statistics” on page 239.

Figure 10-8. Monitoring All the CPUs withmpstat
Code View: Scroll / Show All
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
0 45 1 0 232 0 780 234 106 201 0 950 72 28 0 0
1 29 1 0 243 0 810 243 115 186 0 1045 69 31 0 0
2 27 1 0 235 0 827 243 110 199 0 1000 75 25 0 0
3 26 0 0 217 0 794 227 120 189 0 925 70 30 0 0
4 9 0 0 234 92 403 94 84 1157 0 625 66 34 0 0
5 30 1 0 510 304 764 213 119 176 0 977 69 31 0 0
6 35 1 0 296 75 786 224 114 184 0 1030 68 32 0 0
7 29 1 0 300 96 754 213 116 190 0 982 69 31 0 0
9 41 0 0 356 126 905 231 109 226 0 1078 69 31 0 0
11 26 0 0 231 2 805 235 120 199 0 1047 71 29 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
12 29 0 0 405 164 793 238 117 183 0 879 66 34 0 0
15 31 0 0 288 71 784 223 121 200 0 1049 66 34 0 0
16 71 0 0 263 55 746 213 115 196 0 983 69 31 0 0
17 34 0 0 206 0 743 212 115 194 0 969 69 31 0 0


Cache Affinity Algorithms

Whena system that has multiple caches is in use, a process may run on a CPUand load part of itself into that cache, then stop running for a while.When it resumes, the Unix scheduler must decide which CPU to run it on.To reduce cache traffic, the process must preferentially be allocatedto its original CPU, but that CPU may be in use or the cache may havebeen cleaned out by another process in the meantime.

Thecost of migrating a process to a new CPU depends on the time since itlast ran, the size of the cache, and the speed of the central bus. Tostrike a delicate balance, a general-purpose algorithm that adapts toall kinds of workloads was developed. In the case of the E10000, thisalgorithm manages up to 256 Mbytes of cache and has a significanteffect on performance. The algorithm works by moving jobs to a privaterun queue for each CPU. The job stays on a CPU’s run queue unlessanother CPU becomes idle, looks at all the queues, finds a job thathasn’t run for a while on another queue, and migrates the job to itsown run queue. A cache will be overwritten very quickly, so there islittle benefit from binding a process very tightly to a CPU.

Thereis also a surprising amount of low-level background activity fromvarious daemons and dormant applications. Although this activity doesnot add up to a significant amount of CPU time used, it is enough tocause a large, CPU-bound job to stall long enough for it to migrate toa different CPU. The time-share scheduler ensures that CPU-bound jobsget a long time slice but a low priority. Daemons get a short timeslice and a high priority, because they wake up only briefly. Theaverage time slice for a process can be calculated from the number ofcontext switches and the CPU time used (available from thePRUSAGE call described in “Process Data Sources” on page 416). The SE Toolkit calculates and displays this data with an example script; see “pea.se” on page 488.

Unix on Shared Memory Multiprocessors

TheUnix kernel has many critical regions, or sections of code, where adata structure is being created or updated. These regions must not beinterrupted by a higher-priority interrupt service routine. Auniprocessor Unix kernel manages these regions by setting the interruptmask to a high value while executing in the region. On amultiprocessor, there are other processors with their own interruptmasks, so a different technique must be used to manage critical regions.

The Spin Lock or Mutex

Onekey capability in shared memory, multiprocessor systems is the abilityto perform interprocessor synchronization by means of atomic load/storeor swap instructions. All SPARC chips have an instruction calledLDSTUB,which means load-store-unsigned-byte. The instruction reads a byte frommemory into a register, then writes 0xFF into memory in a single,indivisible operation. The value in the register can then be examinedto see if it was already 0xFF, which means that another processor gotthere first, or if it was 0x00, which means that this processor is incharge. This instruction is used to make mutual exclusion locks (known as mutexes) that make sure only one processor at a time can hold the lock. The lock is acquired throughLDSTUB and cleared by storing 0x00 back to memory.

If a processor does not get the lock, then it may decide to spinby sitting in a loop and testing the lock until it becomes available.By checking with a normal load instruction in a loop before issuing anLDSTUB,the processor performs the spin within the cache, and the bus snoopinglogic watches for the lock being cleared. In this way, spinning causesno bus traffic, so processors that are waiting do not slow down thosethat are working. A spin lock is appropriate when the wait is expectedto be short. If a long wait is expected, the process should sleep for awhile so that a different job can be scheduled onto the CPU. The extraSPARC V9 instructions introduced with UltraSPARC include an atomiccompare-and-swap operation, which is used by the UltraSPARC-specificversion of the kernel to make locking more efficient.

Code Locking

Thesimplest way to convert a Unix kernel that is using interrupt levels tocontrol critical regions for use with multiprocessors is to replace thecall that sets interrupt levels high with a call to acquire a mutexlock. At the point where the interrupt level was lowered, the lock iscleared. In this way, the same regions of code are locked for exclusiveaccess. This method has been used to a greater or lesser extent by mostMP Unix implementations, including SunOS 4 on the SPARCserver 600MPmachines. The amount of actual concurrency that can take place in thekernel is controlled by the number and position of the locks.

InSunOS 4 there is effectively a single lock around the entire kernel.The reason for using a single lock is to make these MP systems totallycompatible with user programs and device drivers written foruniprocessor systems. User programs can take advantage of the extraCPUs, but only one of the CPUs can be executing kernel code at a time.

Whencode locking is used, there are a fixed number of locks in the system;this number can be used to characterize how much concurrency isavailable. On a very busy, highly configured system, the code locks arelikely to become bottlenecks, so that adding extra processors will nothelp performance and may actually reduce performance.

Data Locking and Solaris 2

Theproblem with code locks is that different processors often want to usethe same code to work on different data. To allow this use, locks mustbe placed in data structures rather than in code. Unfortunately, thissolution requires an extensive rewrite of the kernel—one reason whySolaris 2 took several years to create. The result is that the kernelhas a lot of concurrency available and can be tuned to scale well withlarge numbers of processors. The same kernel is used on uniprocessorsand multiprocessors, so that all device drivers and user programs mustbe written to work in the same environment and there is no need toconstrain concurrency for compatibility with uniprocessor systems. Thelocks are still needed in a uniprocessor because the kernel can switchbetween kernel threads at any time to service an interrupt.

Withdata locking, there is a lock for each instance of a data structure.Since table sizes vary dynamically, the total number of locks grows asthe tables grow, and the amount of concurrency that is available toexploit is greater on a very busy, highly configured system. Addingextra processors to such a system is likely to be beneficial. Solaris 2has hundreds of different data locks and multiplied by the number ofinstances of the data, there will typically be many thousand locks inexistence on a running system. As Solaris 2 is tuned for moreconcurrency, some of the remaining code locks are turned into datalocks, and “large” locks are broken down into finer-grained locks.There is a trade-off between having lots of mutexes for good MPscalability and few mutexes for reduced CPU overhead on uniprocessors.

Monitoring Solaris 2.6 Lock Statistics

In Solaris 2.6, thelockstat(1)command offers a kernel lock monitoring capability that is verypowerful and easy to use. The kernel can dynamically change its locksto start collecting information and return to the normal, highlyoptimized lock code when the collection is complete. There is someextra overhead while the locks are instrumented, but usefulmeasurements can be taken in a few seconds without disruption of thework on production machines. Data is presented clearly with severaloptions for how it is summarized, but you still must have a very goodunderstanding of Unix kernel architecture and Solaris 2, in particular,to really understand what is happening.

You should read thelockstat(1)manual page to see all its options, but in normal operation, it is runby the root user for the duration of a process for which you want tomonitor the effects. If you want to run it for an interval, use thesleepcommand. I had trouble finding an example that would cause lockcontention in Solaris 2.6. Eventually, I found that putting a C shellinto an infinite loop, as shown below, generates 120,000 or more systemcalls per second duplicating and shutting file descriptors. When twoshells are run at the same time on a dual CPU system,mpstatshows that hundreds of mutex spins/second occur on each CPU. This testis not intended to be serious! The system call mix was monitored withtruss -c, as shown in Figure 10-9. Withtruss running, the system call rate dropped to about 12,000 calls per second, so I ranlockstat separately fromtruss.

% while 1 
end

Figure 10-9. System Call Mix for C Shell Loop
% truss -c -p 351
^C
syscall seconds calls errors
open .26 2868
close .43 17209
getpid .44 17208
dup .31 14341
sigprocmask .65 28690
---- --- ---
sys totals: 2.09 80316 0
usr time: .39
elapsed: 6.51

The intention was to see what system calls were causing the contention, and it appears to be a combination ofdup andclose, possiblysigprocmask as well. Although there are a lot of calls togetpid, they should not cause any contention.

Theexample output shows that there were 3,318 adaptive mutex spins in 5seconds, which is 663 mutex stalls per second in total and whichmatches the data reported bympstat. The locks and callers are shown; the top lock seems to beflock_lock,which is a file-related lock. Some other locks that are not named inthis display may be members of hashed lock groups. Several of thecalling routines mention file or filesystem operations andsigprocmask, which ties up with the system call trace.

Ifyou see high levels of mutex contention, you need to identify both thelocks that are being contended on and the component of the workloadthat is causing the contention, for example, system calls or a networkprotocol stack. You can then try to change the workload to avoid theproblem, and you should make a service call to Sun to report thecontention. As Sun pushes multiprocessor scalability to 64 processorsand beyond, there are bound to be new workloads and new areas ofcontention that do not occur on smaller configurations. Each newrelease of Solaris 2 further reduces contention to allow even higherscalability and new workloads. There is no point reporting high mutexcontention levels in anything but the latest release of Solaris 2.

Figure 10-10. Examplelockstat Output
Code View: Scroll / Show All
# lockstat sleep 5 


Adaptive mutex spin: 3318 events

Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
601 18% 18% 1.00 1 flock_lock cleanlocks+0x10
302 9% 27% 1.00 7 0xf597aab0 dev_get_dev_info+0x4c
251 8% 35% 1.00 1 0xf597aab0 mod_rele_dev_by_major+0x2c
245 7% 42% 1.00 3 0xf597aab0 cdev_size+0x74
160 5% 47% 1.00 7 0xf5b3c738 ddi_prop_search_common+0x50
157 5% 52% 1.00 1 0xf597aab0 ddi_hold_installed_driver+0x2c
141 4% 56% 1.00 2 0xf5c138e8 dnlc_lookup+0x9c
129 4% 60% 1.00 6 0xf5b1a790 ufs_readlink+0x120
128 4% 64% 1.00 1 0xf5c46910 cleanlocks+0x50
118 4% 67% 1.00 7 0xf5b1a790 ufs_readlink+0xbc
117 4% 71% 1.00 6 stable_lock specvp+0x8c
111 3% 74% 1.00 1 0xf5c732a8 spec_open+0x54
107 3% 77% 1.00 3 stable_lock stillreferenced+0x8
97 3% 80% 1.00 1 0xf5b1af00 vn_rele+0x24
92 3% 83% 1.00 19 0xf5c732a8 spec_close+0x104
84 3% 86% 1.00 1 0xf5b0711c sfind+0xc8
57 2% 87% 1.00 1 0xf5c99338 vn_rele+0x24
53 2% 89% 1.00 54 0xf5b5b2f8 sigprocmask+0xa0
46 1% 90% 1.00 1 0xf5b1af00 dnlc_lookup+0x12c
38 1% 91% 1.00 1 0xf5b1af00 lookuppn+0xbc
30 1% 92% 1.00 2 0xf5c18478 dnlc_lookup+0x9c
28 1% 93% 1.00 33 0xf5b5b4a0 sigprocmask+0xa0
24 1% 94% 1.00 4 0xf5c120b8 dnlc_lookup+0x9c
20 1% 95% 1.00 1 0xf5915c58 vn_rele+0x24
19 1% 95% 1.00 1 0xf5af1e58 ufs_lockfs_begin+0x38
16 0% 96% 1.00 1 0xf5915c58 dnlc_lookup+0x12c
16 0% 96% 1.00 1 0xf5c732a8 spec_close+0x7c
15 0% 97% 1.00 1 0xf5915b08 vn_rele+0x24
15 0% 97% 1.00 17 kmem_async_lock kmem_async_dispatch+0xc
15 0% 97% 1.00 1 0xf5af1e58 ufs_lockfs_end+0x24
12 0% 98% 1.00 1 0xf5c99338 dnlc_lookup+0x12c
11 0% 98% 1.00 1 0xf5c8ab10 vn_rele+0x24
9 0% 98% 1.00 1 0xf5b0711c vn_rele+0x24
8 0% 99% 1.00 3 0xf5c0c268 dnlc_lookup+0x9c
8 0% 99% 1.00 5 kmem_async_lock kmem_async_thread+0x290
7 0% 99% 1.00 2 0xf5c11128 dnlc_lookup+0x9c
7 0% 99% 1.00 1 0xf5b1a720 vn_rele+0x24
5 0% 99% 1.00 2 0xf5b5b2f8 clock+0x308
5 0% 100% 1.00 1 0xf5b1a720 dnlc_lookup+0x12c
3 0% 100% 1.00 1 0xf5915b08 dnlc_lookup+0x12c
2 0% 100% 1.00 1 0xf5b5b4a0 clock+0x308
2 0% 100% 1.00 1 0xf5c149b8 dnlc_lookup+0x9c
2 0% 100% 1.00 1 0xf5c8ab10 dnlc_lookup+0x12c
1 0% 100% 1.00 1 0xf5b183d0 esp_poll_loop+0xcc
1 0% 100% 1.00 1 plocks+0x110 polldel+0x28
1 0% 100% 1.00 2 kmem_async_lock kmem_async_thread+0x1ec
1 0% 100% 1.00 6 mml_table+0x18 srmmu_mlist_enter+0x18
1 0% 100% 1.00 3 mml_table+0x10 srmmu_mlist_enter+0x18
--------------------------------------------------------------------------


Adaptive mutex block: 24 events

Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
3 12% 12% 1.00 90333 0xf5b3c738 ddi_prop_search_common+0x50
3 12% 25% 1.00 235500 flock_lock cleanlocks+0x10
2 8% 33% 1.00 112250 0xf5c138e8 dnlc_lookup+0x9c
2 8% 42% 1.00 281250 stable_lock specvp+0x8c
2 8% 50% 1.00 107000 0xf597aab0 ddi_hold_installed_driver+0x2c
2 8% 58% 1.00 89750 0xf5b5b2f8 clock+0x308
1 4% 63% 1.00 92000 0xf5c120b8 dnlc_lookup+0x9c
1 4% 67% 1.00 174000 0xf5b0711c sfind+0xc8
1 4% 71% 1.00 238500 stable_lock stillreferenced+0x8
1 4% 75% 1.00 143500 0xf5b17ab8 esp_poll_loop+0x8c
1 4% 79% 1.00 44500 0xf5b183d0 esp_poll_loop+0xcc
1 4% 83% 1.00 81000 0xf5c18478 dnlc_lookup+0x9c
1 4% 88% 1.00 413000 0xf5c0c268 dnlc_lookup+0x9c
1 4% 92% 1.00 1167000 0xf5c46910 cleanlocks+0x50
1 4% 96% 1.00 98500 0xf5c732a8 spec_close+0x7c
1 4% 100% 1.00 94000 0xf5b5b4a0 clock+0x308
--------------------------------------------------------------------------


Spin lock spin: 254 events

Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
48 19% 19% 1.00 3 atomic_lock+0x29e rw_exit+0x34
45 18% 37% 1.00 25 atomic_lock+0x24e rw_exit+0x34
44 17% 54% 1.00 1 atomic_lock+0x29d rw_exit+0x34
35 14% 68% 1.00 1 atomic_lock+0x24e rw_enter+0x34
32 13% 80% 1.00 19 cpus+0x44 disp+0x78
22 9% 89% 1.00 1 atomic_lock+0x29d rw_enter+0x34
17 7% 96% 1.00 52 cpu[2]+0x44 disp+0x78
6 2% 98% 1.00 10 cpu[2]+0x44 setbackdq+0x15c
5 2% 100% 1.00 1 atomic_lock+0x29e rw_enter+0x34
--------------------------------------------------------------------------


Thread lock spin: 3 events

Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
3 100% 100% 1.00 19 0xf68ecf5b ts_tick+0x8
-------------------------------------------------------------------------------


Multiprocessor Hardware Configurations

Atany point in time, there exist CPU designs that represent the bestperformance that can be obtained with current technology at areasonable price. The cost and technical difficulty of pushing thetechnology further means that the most cost-effective way of increasingcomputer power is to use several processors. There have been manyattempts to harness multiple CPUs and, today, there are many differentmachines on the market. Software has been the problem for thesemachines. It is hard to design software that will be portable across alarge range of machines and that will scale up well when large numberof CPUs are configured. Over many years, Sun has shipped high unitvolumes of large-scale multiprocessor machines and has built up a largeportfolio of well-tuned applications that are optimized for the Sunarchitecture.

Themost common applications that can use multiprocessors are time-sharedsystems and database servers. These have been joined by multithreadedversions of CPU-intensive programs like MCAD finite element solvers,EDA circuit simulation packages, and other High Performance Computing(HPC) applications. For graphics-intensive applications where the Xserver uses a lot of the CPU power on a desktop machine, configuring adual CPU system will help—the X protocol allows buffering and batchedcommands with few commands needing acknowledgments. In most cases, theapplication sends X commands to the X server and continues to runwithout waiting for them to complete. One CPU runs the application, andthe other runs the X server and drives the display hardware.

Twoclasses of multiprocessor machines have some possibility of softwarecompatibility, and both have SPARC-based implementations. Forillustrations of these two classes, see Figure 10-11 and Figure 10-12.

Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network


Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor


Distributed Memory Multiprocessors

Distributedmemory multiprocessors, also known as Massively Parallel Processors(MPP), can be thought of as a network of uniprocessors. Each processorhas its own memory, and data must be explicitly copied over the networkto another processor before it can be used. The benefit of thisapproach is that there is no contention for memory bandwidth to limitthe number of processors. Moreover, if the network is made up ofpoint-to-point links, then the network throughput increases as thenumber of processors increases. There is no theoretical limit to thenumber of processors that can be used in a system of this type, butthere are problems with the high latency of the point-to-point links,and it is hard to find algorithms that scale with large numbers ofprocessors.

Shared Memory Multiprocessors

Ashared memory multiprocessor is much more tightly integrated than adistributed memory multiprocessor and consists of a fairly conventionalstarting point of CPU, memory, and I/O subsystem, with extra CPUs addedonto the central bus.

Thisconfiguration multiplies the load on the memory system by the number ofprocessors, and the shared bus becomes a bottleneck. To reduce theload, caches are always used and very fast memory systems and buses arebuilt. If more and faster processors are added to the design, the cachesize must increase, the memory system must be improved, and the busthroughput must be increased. Most small-scale MP machines support upto four processors. Larger ones support a few tens of processors. Withsome workloads, the bus or memory system will saturate before themaximum number of processors has been configured.

Specialcircuitry snoops activity on the bus at all times so that all thecaches can be kept coherent. If the current copy of some data is inmore than one of the caches, then it will be marked as being shared. Ifit is updated in one cache, then the copies in the other caches areeither invalidated or updated automatically. From a software point ofview, there is never any need to explicitly copy data from oneprocessor to another, and shared memory locations are used tocommunicate values among CPUs. The cache-to-cache transfers still occurwhen two CPUs use data in the same cache line, so such transfers mustbe considered from a performance point of view, but the software doesnot have to worry about it.

Thereare many examples of shared memory mainframes and minicomputers.SPARCbased Unix multiprocessors range from the dual-processor Ultra 2and the 4-processor E450, to the 30-processor E6000 and the64-processor E10000.

Thehigh-end machines normally have multiple I/O subsystems and multiplememory subsystems connected to a much wider central bus, allowing moreCPUs to be configured without causing bottlenecks in the I/O and memorysystems. The E10000 goes one step further by using a crossbar switch asits interconnect. Multiple transfers occur between CPU, memory, and I/Oon separate point-to-point data paths that all transfer dataconcurrently. This mechanism provides higher throughput, lowcontention, and low latency.

Anunavoidable fact of life is that the memory latency increases as moreCPUs are handled by a system design. Even with the same CPU moduleconfiguration, an Ultra 2 or E450 has lower main memory latency thandoes an E6000, which is in turn lower latency than an E10000. Theperformance of a single uniprocessor job is therefore higher on asmaller machine. The latency of a system increases at high load levels,and this load-dependent increase is more of an issue on a smallermachine. When comparing the E6000 and E10000, it seems that the E6000has an advantage up to 20 CPUs, and the E10000 is usually faster forloads that consume 24–28 CPUs. The E10000 is the only option from 30–64 CPUs, and it maintains a lower latency under heavy load than anyother system. So, from purely a performance point of view, it is farfaster to have a row of four CPU E450s than to use an E10000 that isdivided into many, separate, four-CPU domains.

Clusters of Shared Memory Machines

Inany situation, the most efficient configuration is to use the smallestnumber of the fastest CPUs communicating over the lowest latencyinterconnect. This configuration minimizes contention and maximizesscalability. It is often necessary to compromise for practical or costreasons, but it is clear that the best approach to building alarge-scale system is to use the largest shared memory systems you canget, then cluster them together with the smallest number of nodes. Theperformance of a cluster is often limited by activities that are toohard to distribute, and so performance degrades to the performance of asingle node for some of the time. If that node is as large as possible,then the overall efficiency is much higher.

Shared/ Distributed Memory Multiprocessors

Althoughthe two types of MP machines started out as very different approaches,the most recent examples of these types have converged.

Theshared memory machines all have large caches for each CPU. For goodperformance, it is essential for the working set of data andinstructions to be present in the cache. The cache is now analogous tothe private memory of a distributed memory machine, and many algorithmsthat have been developed to partition data for a distributed memorymachine are applicable to a cache-based, shared memory machine.

Thelatest distributed memory machines have moved from a software- orDMA-driven point-to-point link to an MMU-driven, cache-line-basedimplementation of distributed shared memory. With this system, thehardware is instructed to share a region of address space with anotherprocessor, and cache coherence is maintained over a point-to-pointlink. As the speed of these links approaches the speed of a sharedmemory backplane, the distributed memory system can be used to runshared memory algorithms more efficiently. Link speeds are comparableto the low end of shared memory backplane speeds, although theytypically have much higher latency. The extra latency is due to thephysical distances involved and the use of complex switching systems tointerconnect large numbers of processors. Large distributed memorysystems have higher latency than do smaller ones.

NUMA Multiprocessors

Thehybrid that has emerged is known as nonuniform memory access, or NUMA.NUMAs are shared memory, multiprocessor machines where the memorylatency varies. If the access is to memory in the local node, thenaccess is fast; if the access request has to travel over one or morecommunication links, then access becomes progressively slower. The nodeitself is often a single-board SMP with two to four CPUs. The SGIOrigin 2000 system uses a twin-CPU node; the various IntelPentium-based systems from Sequent, Data General, and others use a4-CPU node. However, it makes far more sense when building NUMA systemsto use an entire high-end server as a node, with 20 or more CPUs. Thebulk of high-end server system requirements are for 10 to 30 or soCPUs, and customers like to buy a half-configured system to allow forfuture expansion. For Sun this requirement can be satisfied by an SMPsuch as the E6000 or E10000, so Sun had no need to develop a NUMA-basedsystem for the 1996–1998 generation. NUMA-based competitors mustconfigure multiple nodes to satisfy today’s requirements, and users ofthese systems are suffering from the new problems that come with thistechnology.

Benchmarks such as the TCP-D data warehouse test—see http://www.tpc.org—haveshown that Sun’s SMP machines do beat the NUMA machines on price andperformance over the whole range of database sizes. This benchmark isideally suited to NUMA because the query structures are well understoodand the data layout can be optimized to keep the NUMA nodes wellbalanced. In real life, data warehouse queries are often constructed adhoc, and it is not possible to optimize data placement in advance. TheTPC-D benchmark is substantially overstating the performance of NUMAmachines as data warehouse engines because of this effect. Dataplacement is not an issue in one large SMP node. In general, theworkload on a system is not constant; it can vary from one query to thenext or from one day to the next. In a benchmark situation, a NUMAmachine can be optimized for the fixed benchmark workload but wouldneed to be continuously reoptimized for real-world workloads.

Atechnique used to speed up NUMA systems is data replication. The samedata can be replicated in every node so that it is always available ata short memory latency. This technique has been overused by somevendors in benchmarks, with hundreds of Mbytes of data duplicated overall the nodes. The effect is that a large amount of RAM is wastedbecause there are many nodes over which to duplicate the data. In a32-CPU system made up of 2-way or 4-way nodes, you might need toconfigure extra memory to hold 16 or 8 copies of the data. On a 32-CPUSMP system, there is a single copy. If there are any writes to thisdata, the NUMA machines must synchronize all their writes, asignificant extra load.

Forall these reasons, the best MPP or NUMA configuration should be builtby clustering the smallest number of the largest possible SMP nodes.

CPU Caches

Performancecan vary even among machines that have the same CPU and clock rate ifthey have different memory systems. The interaction between caches andalgorithms can cause performance problems, and an algorithm change maybe called for to work around the problem. This section providesinformation on the CPU caches and memory management unit designs ofvarious SPARC platforms so that application developers can understandthe implications of differing memory systems[1]. This section may also be useful to compiler writers.

[1] High Performance Computing by Keith Dowd covers this subject very well.

CPU Cache History

Historically,the memory systems on Sun machines provided equal access times for allmemory locations. This was true on the Sun-2™ machines and was true fordata accesses on the Sun-3/50™ and Sun-3/60 machines.

Equalaccess was achieved by running the entire memory system as fast as theprocessor could access it. On Motorola 68010 and 68020 processors, fourclock cycles were required for each memory access. On a 10 MHz Sun-2,the memory ran at a 400 ns cycle time; on a 20 MHz Sun-3/60, it ran ata 200 ns cycle time, so the CPU never had to wait for memory to beready.

SPARCprocessors run at higher clock rates and want to access memory in asingle cycle. Main memory cycle times have improved a little, and verywide memory systems can transfer a 32- or 64-byte block almost asquickly as a single word. The CPU cache is designed to cope with themismatch between the way DRAM can handle blocks of memory efficientlyand the way that the CPU wants to access one word at a time at muchhigher speed.

The CPU cache is an area of fast memory made from staticRAM, or SRAM. The cache transfers data from main DRAM memory using pagemode in a block of 16, 32, or 64 bytes at a time. Hardware keeps trackof the data in the cache and copies data into the cache when required.

Moreadvanced caches have multiple levels and can be split so thatinstructions and data use separate caches. The UltraSPARC CPU has a16-Kbyte instruction cache and separate 16-Kbyte data cache, which areloaded from a second-level 512-Kbyte to 4-Mbyte combined cache thatloads from the memory system. We first examine simple caches before welook at the implications of more advanced configurations.

Cache Line and Size Effects

ASPARCstation 2 is a very simple design with clearly predictablebehavior. More recent CPU designs are extremely complicated, so I’llstart with a simple example. The SPARCstation 2 has its 64 Kbytes ofcache organized as 2048 blocks of 32 bytes each. When the CPU accessesan address, the cache controller checks to see if the right data is inthe cache; if it is, then the CPU loads it without stalling. If thedata needs to be fetched from main memory, then the CPU clock iseffectively stopped for 24 or 25 cycles while the cache block isloaded. The implications for performance tuning are obvious. If yourapplication is accessing memory very widely and its accesses are missing the cache rather than hitting the cache, then the CPU may spend a lot of its time stopped. By changing the application, you may be able to improve the hit rate and achieve a worthwhile performance gain. The 25-cycle delay is known as the miss cost,and the effective performance of an application is reduced by a largemiss cost and a low hit rate. The effect of context switches is tofurther reduce the cache hit rate because after a context switch thecontents of the cache will need to be replaced with instructions anddata for the new process.

Applicationsaccess memory on almost every cycle. Most instructions take a singlecycle to execute, and the instructions must be read from memory; dataaccesses typically occur on 20–30 percent of the cycles. The effect ofchanges in hit rate for a 25-cycle miss cost are shown in Table 10-1and Figure 10-13 in both tabular and graphical forms. A 25-cycle miss cost implies that a hit takes 1 cycle and a miss takes 26 cycles.

Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost


Table 10-1. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost
Hit Rate Hit Time Miss Time Total Time Performance 100% 100% 0% 100% 100% 99% 99% 26% 125% 80% 98% 98% 52% 150% 66% 96% 96% 104% 200% 50%

Theexecution time increases dramatically as the hit rate drops. Although a96 percent hit rate sounds quite high, you can see that the programwill be running at half speed. Many small benchmarks like Dhrystone runat 100 percent hit rate; real applications such as databases are movingdata around and following linked lists and so have a poor hit rate. Itisn’t that difficult to do things that are bad for the cache, so it isa common cause of reduced performance.

Acomplex processor such as UltraSPARC I does not have fixed cache misscosts; the cost depends upon the exact instruction sequencing, and inmany cases the effective miss cost is reduced by complex hardwarepipelining and compiler optimizations. The on-chip, 16-Kbyte data cachetakes 1 cycle, the external cache between 1 and 7 clock cycles (it cansupply a word of data in every cycle but may stall the CPU for up to 7cycles if the data is used immediately), and main memory between 40 and50 cycles, depending upon the CPU and system clock rates. With thecache hit rate varying in a two-level cache, we can’t draw a simpleperformance curve, but relatively, performance is better than the curveshown in Figure 10-13 when the system is running inside the large second-level cache, and worse when running from main memory.

A Problem with Linked Lists

Applicationsthat traverse large linked lists cause problems with caches. Let’s lookat this in detail on a simple SPARCstation 2 architecture and assumethat the list has 5,000 entries. Each block on the list contains somedata and a link to the next block. If we assume that the link islocated at the start of the block and that the data is in the middle ofa 100-byte block, as shown in Figure 10-14, then we can deduce the effect on the memory system of chaining down the list.

Figure 10-14. Linked List Example


The code to perform the search is a tight loop, shown in Figure 10-15.This code fits in seven words, one or two cache lines at worst, so thecache is working well for code accesses. Data accesses occur when thelink and data locations are read. If the code is simply looking for aparticular data value, then these data accesses will occur every fewcycles. They will never be in the same cache line, so every data accesswill cause a 25-cycle miss that reads in 32 bytes of data when only 4bytes were wanted. Also, with a 64-Kbyte cache, only 2,048 cache linesare available, so after 1,024 blocks have been read in, the cache linesmust be reused. This means that a second attempt to search the listwill find that the start of the list is no longer in the cache.

Figure 10-15. Linked List Search Code in C
/* C code to search a linked list for a value and miss the cache a lot */ 

struct block {
struct block *link; /* link to next block */
int pad1[11];
int data; /* data item to check for */
int pad2[12];
} blocks[5000];

struct block *find(pb,value)
struct block *pb;
int value;
{
while(pb) /* check for end of linked list */
{
if (pb->data == value) /* check for value match */
return pb; /* return matching block */
pb = pb->link; /* follow link to next block */
}
return (struct block *)0; /* return null if no match */
}

Theonly solution to this problem is an algorithmic change. The problemwill occur on any of the current generation of computer systems. Infact, the problem gets worse as the processor gets faster since themiss cost tends to increase because of the difference in speed betweenthe CPU and cache clock rate and the main memory speed. With morerecent processors, it may be possible to cache a larger linked list inexternal cache, but the on-chip cache is smaller than 64 Kbytes. Whatmay happen is that the first time through the list, each cache misscosts 50 cycles, and on subsequent passes it costs 8 cycles.

Thewhileloop compiles with optimization to just seven instructions, includingtwo loads, two tests, two branches, and a no-op, as shown in Figure10-16. Note that the instruction after a branch is always executed on aSPARC processor. This loop executes in 9 cycles (225 ns) on aSPARCstation 2 if it hits the cache, and in 59 cycles (1,475 ns) ifboth loads miss. On a 200 MHz UltraSPARC, the loop would run inapproximately 5 cycles (25 ns) if it hit the on-chip cache, 20 cycles(100 ns) in the off-chip cache, and 100 cycles (500 ns) from mainmemory the first time through.

Figure 10-16. Linked List Search Loop in Assembler
LY1:                           /* loop setup code omitted */ 
cmp %o2,%o1 /* see if data == value */
be L77016 /* and exit loop if matched */
nop /* pad branch delay slot */
ld [%o0],%o0 /* follow link to next block */
tst %o0 /* check for end of linked list */
bne,a LY1 /* branch back to start of loop */
ld [%o0+48],%o2 /* load data in branch delay slot */

The memory latency test in Larry McVoy’slmbenchset of benchmarks is based on this situation. It gives a worst-casesituation of unrelated back-to-back loads but is very common in realapplication code. See also “The Cache-Aligned Block Copy Problem” on page 265 for a description of another cache problem.

For detailed information on specific cache architectures, see “SPARC CPU Cache Architectures” on page 258. Chapter 12, “Caches,” on page 299 provides a more detailed explanation of the way in which caches in general work.