[Cockcroft98] Chapter 10. Processors

来源：百度文库编辑：神马文学网时间：2024/04/29 04:49:48

Chapter 10. Processors

Previouschapters have looked at the disk and network components of a system.This chapter covers the processor component. The details of particularprocessors are dealt with later. This chapter explains the measurementsthat each version of Solaris makes on a per processor basis. It alsodiscusses the trade-offs made by different multiprocessor architectures.

Monitoring Processors

This section lists the most important commands that you should know and explains where their data comes from.

CPU Control Commands —`psrinfo`,`psradm`, and`psrset`

psrinfo tells you which CPUs are in use and when they were last enabled or disabled.psradmactually controls the CPUs. Note that in server systems, interrupts arebound to specific CPUs to improve the cache hit rate, and clockinterrupts always go to CPU 0. Even if a CPU is disabled, it stillreceives interrupts. Thepsrset command manages processor sets in Solaris 2.6.

The Load Average

Theload average is the sum of the run queue length and the number of jobscurrently running on CPUs. The three figures given are averages overthe last 1, 5, and 15 minutes. They can be used as an average run queuelength indicator to see if more CPU power is required.

% uptime 
 11:40pm  up 7 day(s),  3:54,  1 user,  load average: 0.27, 0.22, 0.24

The Data Behind`vmstat` and`mpstat` Output

What do all those columns of data invmstat mean? How do they relate to the data frommpstat, and where does it all come from?

First, let’s look at Figure 10-1 to remind ourselves whatvmstat itself looks like.

Figure 10-1. Example`vmstat` Output

Code View: Scroll / Show All

% vmstat 5 
 procs      memory             page            disk           faults       cpu 
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s2 s3   in   sy   cs us sy id 
 0 0 0  72724 25348   0   2  3  1  1  0  0  0  0  1  0   63  362   85  1  1 98 
 0 0 0  64724 25184   0  24 56  0  0  0  0  0  0 19  0  311 1112  356  2  4 94 
 0 0 0  64724 24796   0   5 38  0  0  0  0  0  0 15  0   92  325  212  0  1 99 
 0 0 0  64680 24584   0  12 106 0  0  0  0  0  0 41  0  574 1094  340  2  5 93 
 0 1 0  64632 23736   0   0 195 0  0  0  0  0  0 66  0  612  870  270  1  7 92 
 0 0 0  64628 22796   0   0 144 0  0  0  0  0  0 59  0  398  764  222  1  8 91 
 0 0 0  64620 22796   0   0 79  0  0  0  0  0  0 50  0  255 1383  136  2 18 80

Thecommand prints the first line of data immediately, then every fiveseconds prints a new line that gives the average rates over thefive-second interval. The first line is also the average rate over theinterval that started when the system was booted because the numbersare stored by the system as counts of the number of times each eventhas happened. To average over a time interval, you measure the countersat the start and end and divide the difference by the time interval.For the very first measure, there are zero counters to subtract, so youautomatically get the count since boot, divided by the time since boot.The absolute counters themselves can be seen with another option tovmstat, as shown in Figure 10-2.

Figure 10-2. Example`vmstat -s` Raw Counters Output

Code View: Scroll / Show All

% vmstat -s 
        0 swap ins 
        0 swap outs 
        0 pages swapped in 
        0 pages swapped out 
   208724 total address trans. faults taken 
    45821 page ins 
     3385 page outs 
    61800 pages paged in 
    27132 pages paged out 
      712 total reclaims 
      712 reclaims from free list 
        0 micro (hat) faults 
   208724 minor (as) faults 
    44636 major faults 
    34020 copy-on-write faults 
    77883 zero fill page faults 
     9098 pages examined by the clock daemon 
        1 revolutions of the clock hand 
    27748 pages freed by the clock daemon  
     1333 forks 
      187 vforks 
     1589 execs 
  6730851 cpu context switches 
 12848989 device interrupts 
   340014 traps 
 28393796 system calls 
   285638 total name lookups (cache hits 91%) 
      108 toolong 
   159288 user cpu 
   123409 system cpu 
 15185004 idle cpu 
   192794 wait cpu

The other closely related command ismpstat, which shows basically the same data but on a per-CPU basis. Figure 10-3 shows some output on a dual CPU system.

Figure 10-3. Example`mpstat` Output

Code View: Scroll / Show All

% mpstat 5 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    1   0    4    82   17   43    0    5    1    0   182    1   1   1  97 
  2    1   0    3    81   17   42    0    5    2    0   181    1   1   1  97 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    2   0   39   156  106   42    0    5   21    0    30    0   2  61  37 
  2    0   0    0   158  106  103    5    4    8    0  1704    3  36  61   0 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    0  19   28   194  142   96    1    4   18    0   342    1   8  76  16 
  2    0   6   11   193  141   62    4    4   10    0   683    5  15  74   6 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0    0  22   33   215  163   87    0    7    0    0   287    1   4  90   5 
  2    0  22   29   214  164   88    2    8    1    0   304    2   5  89   4

The Sources of`vmstat` Data

The data is read from the kernel statistics interface that is described in “The Solaris 2“kstat” Interface”on page 387. Most of the data is maintained on a per-CPU basis by thekernel and is combined into the overall summaries by the commandsthemselves. The samekstat(3K)programming interface that is used to get at the per-disk statistics isused. It is very lightweight—it takes only a few microseconds toretrieve each data structure. The data structures are based on thosedescribed in the file/usr/include/sys/sysinfo.h, the system information header file. Of course, all the rawkstat data is directly available to the SE toolkit, which contains customized scripts; see “vmstat.se” on page 500 and “mpvmstat.se” on page 485.

Process Queues

As we follow through the fields ofvmstat, we see that the first one is labeled procs, r, b, w. This field is derived from thesysinfo data, and there is a single globalkstat with the same form as thesysinfo structure shown in Figure 10-4.

Figure 10-4. The`sysinfo` Process Queue Data Structure

typedef struct sysinfo {        /* (update freq) update action          */ 
        ulong   updates;        /* (1 sec) ++                           */ 
        ulong   runque;         /* (1 sec) += num runnable procs        */ 
        ulong   runocc;         /* (1 sec) ++ if num runnable procs > 0 */ 
        ulong   swpque;         /* (1 sec) += num swapped procs         */ 
        ulong   swpocc;         /* (1 sec) ++ if num swapped procs > 0  */ 
        ulong   waiting;        /* (1 sec) += jobs waiting for I/O      */ 
} sysinfo_t;

Asthe comments indicate, the field is updated once per second. The runqueue counts the number of runnable processes that are not running. Theextra data values, calledrunocc andswpocc, are displayed bysar -q and show the occupancy of the queues. Solaris 2 counts the total number of swapped-out idle processes, so if you see any swapped jobs registered inswpque, don’t be alarmed.sar -q is strange: If the number to be displayed is zero,sar -qdisplays nothing at all, just white space. If you add the number ofprocesses actually running on CPUs in the system, then you get the samemeasure on which the load average is based.waiting counts how many processes are waiting for a block device transfer (a disk I/O) to complete. It shows up as theb field invmstat.

Virtual Memory Counters

The next part ofvmstat lists the free swap space and memory. This information is obtained as akstat from a single globalvminfo structure, as shown in Figure 10-5.

Figure 10-5. The`vminfo` Memory Usage Data Structure

typedef struct vminfo {         /* (update freq) update action          */ 
        longlong_t freemem;     /* (1 sec) += freemem in pages          */ 
        longlong_t swap_resv;   /* (1 sec) += reserved swap in pages    */ 
        longlong_t swap_alloc;  /* (1 sec) += allocated swap in pages   */ 
        longlong_t swap_avail;  /* (1 sec) += unreserved swap in pages  */ 
        longlong_t swap_free;   /* (1 sec) += unallocated swap in pages */ 
} vminfo_t;

The only swap number shown byvmstat isswap_avail,which is the most important one. If this number ever goes to zero, yoursystem will hang up and be unable to start more processes! For somestrange reason,sar -r reportsswap_free instead and converts the data into useless units of 512-byte blocks. The bizarre state of thesar command is one reason why we were motivated to create and develop the SE toolkit in the first place!

Thesemeasures are accumulated, so you calculate an average level by takingthe values at two points in time and dividing the difference by thenumber of updates.

Paging Counters

The paging counters shown in Figure 10-6 are maintained on a per-cpu basis; it’s also clear thatvmstat andsar don’t show all of the available information. The states and state transitions being counted are described in detail in “The Life Cycle of a Typical Physical Memory Page” on page 326.

Figure 10-6. The`cpu_vminfo` Per CPU Paging Counters Structure

Code View: Scroll / Show All

typedef struct cpu_vminfo {
        ulong   pgrec;          /* page reclaims (includes page-out)     */ 
        ulong   pgfrec;         /* page reclaims from free list         */ 
        ulong   pgin;           /* page-ins                              */ 
        ulong   pgpgin;         /* pages paged in                       */ 
        ulong   pgout;          /* page-outs                             */ 
        ulong   pgpgout;        /* pages paged out                      */ 
        ulong   swapin;         /* swap-ins                              */ 
        ulong   pgswapin;       /* pages swapped in                     */ 
        ulong   swapout;        /* swap-outs                             */ 
        ulong   pgswapout;      /* pages swapped out                    */ 
        ulong   zfod;           /* pages zero filled on demand          */ 
        ulong   dfree;          /* pages freed by daemon or auto        */ 
        ulong   scan;           /* pages examined by page-out daemon     */ 
        ulong   rev;            /* revolutions of the page daemon hand  */ 
        ulong   hat_fault;      /* minor page faults via hat_fault()    */ 
        ulong   as_fault;       /* minor page faults via as_fault()     */ 
        ulong   maj_fault;      /* major page faults                    */ 
        ulong   cow_fault;      /* copy-on-write faults                 */ 
        ulong   prot_fault;     /* protection faults                    */ 
        ulong   softlock;       /* faults due to software locking req   */ 
        ulong   kernel_asflt;   /* as_fault()s in kernel addr space     */ 
        ulong   pgrrun;         /* times pager scheduled                */ 
} cpu_vminfo_t;

Afew of these might need some extra explanation. Protection faults occurwhen a program tries to access memory it shouldn’t, gets a segmentationviolation signal, and dumps a core file. Hat faults occur only onsystems that have a software-managed memory management unit (sun4c andsun4u). Hat stands for hardware address translation.

I’ll skip the disk counters becausevmstat is just reading the samekstat data asiostat and providing a crude count of the number of operations for the first four disks.

CPU Usage and Event Counters

Let’s remind ourselves again whatvmstat looks like.

Code View:Scroll/Show All

% vmstat 5 
 procs       memory            page            disk           faults       cpu 
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s2 s3   in   sy   cs us sy id 
 0 0 0  72724 25348   0   2  3  1  1  0  0  0  0  1  0   63  362   85  1  1 98

Thelast six columns show the interrupt rate, system call rate, contextswitch rate, and CPU user, system, and idle time. The per-cpu structurefrom which these are derived is the biggest structure yet, with about60 values. Some of them are summarized bysar, but a lot of interesting stuff here is being carefully recorded by the kernel, is read byvmstat, and is then just thrown away. Look down the comments in Figure 10-7,and at the end I’ll point out some nonobvious and interesting values.The cpu and wait states are arrays holding the four CPUstates—usr/sys/idle/wait—and three wait states, io/swap/pio. Only theio wait state is implemented, so this measurement is superfluous. (Imade thempvmstat.se script print out the wait states before I realized that they were always zero.)

Figure 10-7. The`cpu_sysinfo` Per-CPU System Information Structure

Code View: Scroll / Show All

typedef struct cpu_sysinfo {
        ulong   cpu[CPU_STATES]; /* CPU utilization                     */ 
        ulong   wait[W_STATES]; /* CPU wait time breakdown              */ 
        ulong   bread;          /* physical block reads                 */ 
        ulong   bwrite;         /* physical block writes (sync+async)   */ 
        ulong   lread;          /* logical block reads                  */ 
        ulong   lwrite;         /* logical block writes                 */ 
        ulong   phread;         /* raw I/O reads                        */ 
        ulong   phwrite;        /* raw I/O writes                       */ 
        ulong   pswitch;        /* context switches                     */ 
        ulong   trap;           /* traps                                */ 
        ulong   intr;           /* device interrupts                    */ 
        ulong   syscall;        /* system calls                         */ 
        ulong   sysread;        /* read() + readv() system calls        */ 
        ulong   syswrite;       /* write() + writev() system calls      */ 
        ulong   sysfork;        /* forks                                */ 
        ulong   sysvfork;       /* vforks                               */ 
        ulong   sysexec;        /* execs                                */ 
        ulong   readch;         /* bytes read by rdwr()                 */ 
        ulong   writech;        /* bytes written by rdwr()              */ 
        ulong   rcvint;         /* XXX: UNUSED                          */ 
        ulong   xmtint;         /* XXX: UNUSED                          */ 
        ulong   mdmint;         /* XXX: UNUSED                          */ 
        ulong   rawch;          /* terminal input characters            */ 
        ulong   canch;          /* chars handled in canonical mode      */ 
        ulong   outch;          /* terminal output characters           */ 
        ulong   msg;            /* msg count (msgrcv()+msgsnd() calls)  */ 
        ulong   sema;           /* semaphore ops count (semop() calls)  */ 
        ulong   namei;          /* pathname lookups                     */ 
        ulong   ufsiget;        /* ufs_iget() calls                     */ 
        ulong   ufsdirblk;      /* directory blocks read                 */ 
        ulong   ufsipage;       /* inodes taken with attached pages     */ 
        ulong   ufsinopage;     /* inodes taken with no attached pages  */ 
        ulong   inodeovf;       /* inode table overflows                */ 
        ulong   fileovf;        /* file table overflows                 */ 
        ulong   procovf;        /* proc table overflows                 */ 
        ulong   intrthread;     /* interrupts as threads (below clock)  */ 
        ulong   intrblk;        /* intrs blkd/preempted/released (swtch) */ 
        ulong   idlethread;     /* times idle thread scheduled          */ 
        ulong   inv_swtch;      /* involuntary context switches         */ 
        ulong   nthreads;       /* thread_create()s                     */ 
        ulong   cpumigrate;     /* cpu migrations by threads            */ 
        ulong   xcalls;         /* xcalls to other cpus                 */ 
        ulong   mutex_adenters; /* failed mutex enters (adaptive)       */ 
        ulong   rw_rdfails;     /* rw reader failures                   */ 
        ulong   rw_wrfails;     /* rw writer failures                   */ 
        ulong   modload;        /* times loadable module loaded         */ 
        ulong   modunload;      /* times loadable module unloaded       */ 
        ulong   bawrite;        /* physical block writes (async)        */ 
/* Following are gathered only under #ifdef STATISTICS in source        */ 
        ulong   rw_enters;      /* tries to acquire rw lock             */ 
        ulong   win_uo_cnt;     /* reg window user overflows            */ 
        ulong   win_uu_cnt;     /* reg window user underflows           */ 
        ulong   win_so_cnt;     /* reg window system overflows          */ 
        ulong   win_su_cnt;     /* reg window system underflows         */ 
        ulong   win_suo_cnt;    /* reg window system user overflows     */ 
} cpu_sysinfo_t;

Some of the numbers printed bympstat are visible here. Thesmtx value used to watch for kernel contention ismutex_adenters. Thesrw value is the sum of the failures to obtain a readers/writer lock. The termxcalls is shorthand for cross-calls. A cross-call occurs when one CPU wakes up another CPU by interrupting it.

vmstatprints out 22 columns of numbers, summarizing over a hundred underlyingmeasures (even more on a multiprocessor). It’s good to have a lot ofdifferent things summarized in one place, but the layout ofvmstat (andsar) is as much a result of their long history as it is by design. Could you do better?

I’mafraid I’m going to end up plugging the SE toolkit again. It’s just soeasy to get at this data and do things with it. All the structures canbe read by any user, with no need to be setuid root. (This is a keyadvantage of Solaris 2; other Unix systems read the kernel directly, soyou would have to obtain root permissions.)

If you want to customize your very ownvmstat, you could either write one from scratch in C, using thekstat library as described in “The Solaris 2 “kstat”Interface” on page 387, or you could load the SE toolkit and spend afew seconds modifying a trivial script. Either way, if you come up withsomething that you think is an improvement, while staying with thebasic concept of one line of output that fits in 80 columns, send it tome, and we’ll add it to the next version of SE.

Ifyou are curious about how some of these numbers behave but can’t bebothered to write your own SE scripts, you should try out the GUI frontend to the rawkstat data that is provided in the SE toolkit asinfotool.se. A sample snapshot is shown in Figure 16-12 on page 480.

Use of`mpstat` to Monitor Interrupts and Mutexes

Thempstatoutput shown below was measured on a 16-CPU SPARCcenter 2000 runningSolaris 2.4 and with about 500 active time-shared database users. Oneof the key measures issmtx, thenumber of times the CPU failed to obtain a mutex immediately. If thenumber is more than about 200 per CPU, then usually system time beginsto climb. The exception is the master CPU that is taking the 100 Hzclock interrupt, normally CPU 0 but in this case CPU 4. It has a largernumber (1,000 or more) of very short mutex stalls that don’t hurtperformance. Higher-performance CPUs can cope with highersmtxvalues before they start to have a problem; more recent releases ofSolaris 2 are tuned to remove sources of mutex contention. The best wayto solve high levels of mutex contention is to upgrade to at leastSolaris 2.6, then use thelockstat command to find which mutexes are suffering from contention, as described in “Monitoring Solaris 2.6 Lock Statistics” on page 239.

Figure 10-8. Monitoring All the CPUs with`mpstat`

Code View: Scroll / Show All

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
  0   45   1    0   232    0  780  234  106  201    0   950   72  28   0   0 
  1   29   1    0   243    0  810  243  115  186    0  1045   69  31   0   0 
  2   27   1    0   235    0  827  243  110  199    0  1000   75  25   0   0 
  3   26   0    0   217    0  794  227  120  189    0   925   70  30   0   0 
  4    9   0    0   234   92  403   94   84 1157    0   625   66  34   0   0 
  5   30   1    0   510  304  764  213  119  176    0   977   69  31   0   0 
  6   35   1    0   296   75  786  224  114  184    0  1030   68  32   0   0 
  7   29   1    0   300   96  754  213  116  190    0   982   69  31   0   0 
  9   41   0    0   356  126  905  231  109  226    0  1078   69  31   0   0 
 11   26   0    0   231    2  805  235  120  199    0  1047   71  29   0   0 
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl 
 12   29   0    0   405  164  793  238  117  183    0   879   66  34   0   0 
 15   31   0    0   288   71  784  223  121  200    0  1049   66  34   0   0 
 16   71   0    0   263   55  746  213  115  196    0   983   69  31   0   0 
 17   34   0    0   206    0  743  212  115  194    0   969   69  31   0   0

Cache Affinity Algorithms

Whena system that has multiple caches is in use, a process may run on a CPUand load part of itself into that cache, then stop running for a while.When it resumes, the Unix scheduler must decide which CPU to run it on.To reduce cache traffic, the process must preferentially be allocatedto its original CPU, but that CPU may be in use or the cache may havebeen cleaned out by another process in the meantime.

Thecost of migrating a process to a new CPU depends on the time since itlast ran, the size of the cache, and the speed of the central bus. Tostrike a delicate balance, a general-purpose algorithm that adapts toall kinds of workloads was developed. In the case of the E10000, thisalgorithm manages up to 256 Mbytes of cache and has a significanteffect on performance. The algorithm works by moving jobs to a privaterun queue for each CPU. The job stays on a CPU’s run queue unlessanother CPU becomes idle, looks at all the queues, finds a job thathasn’t run for a while on another queue, and migrates the job to itsown run queue. A cache will be overwritten very quickly, so there islittle benefit from binding a process very tightly to a CPU.

Thereis also a surprising amount of low-level background activity fromvarious daemons and dormant applications. Although this activity doesnot add up to a significant amount of CPU time used, it is enough tocause a large, CPU-bound job to stall long enough for it to migrate toa different CPU. The time-share scheduler ensures that CPU-bound jobsget a long time slice but a low priority. Daemons get a short timeslice and a high priority, because they wake up only briefly. Theaverage time slice for a process can be calculated from the number ofcontext switches and the CPU time used (available from thePRUSAGE call described in “Process Data Sources” on page 416). The SE Toolkit calculates and displays this data with an example script; see “pea.se” on page 488.

Unix on Shared Memory Multiprocessors

TheUnix kernel has many critical regions, or sections of code, where adata structure is being created or updated. These regions must not beinterrupted by a higher-priority interrupt service routine. Auniprocessor Unix kernel manages these regions by setting the interruptmask to a high value while executing in the region. On amultiprocessor, there are other processors with their own interruptmasks, so a different technique must be used to manage critical regions.

The Spin Lock or Mutex

Onekey capability in shared memory, multiprocessor systems is the abilityto perform interprocessor synchronization by means of atomic load/storeor swap instructions. All SPARC chips have an instruction calledLDSTUB,which means load-store-unsigned-byte. The instruction reads a byte frommemory into a register, then writes 0xFF into memory in a single,indivisible operation. The value in the register can then be examinedto see if it was already 0xFF, which means that another processor gotthere first, or if it was 0x00, which means that this processor is incharge. This instruction is used to make mutual exclusion locks (known as mutexes) that make sure only one processor at a time can hold the lock. The lock is acquired throughLDSTUB and cleared by storing 0x00 back to memory.

If a processor does not get the lock, then it may decide to spinby sitting in a loop and testing the lock until it becomes available.By checking with a normal load instruction in a loop before issuing anLDSTUB,the processor performs the spin within the cache, and the bus snoopinglogic watches for the lock being cleared. In this way, spinning causesno bus traffic, so processors that are waiting do not slow down thosethat are working. A spin lock is appropriate when the wait is expectedto be short. If a long wait is expected, the process should sleep for awhile so that a different job can be scheduled onto the CPU. The extraSPARC V9 instructions introduced with UltraSPARC include an atomiccompare-and-swap operation, which is used by the UltraSPARC-specificversion of the kernel to make locking more efficient.

Code Locking

Thesimplest way to convert a Unix kernel that is using interrupt levels tocontrol critical regions for use with multiprocessors is to replace thecall that sets interrupt levels high with a call to acquire a mutexlock. At the point where the interrupt level was lowered, the lock iscleared. In this way, the same regions of code are locked for exclusiveaccess. This method has been used to a greater or lesser extent by mostMP Unix implementations, including SunOS 4 on the SPARCserver 600MPmachines. The amount of actual concurrency that can take place in thekernel is controlled by the number and position of the locks.

InSunOS 4 there is effectively a single lock around the entire kernel.The reason for using a single lock is to make these MP systems totallycompatible with user programs and device drivers written foruniprocessor systems. User programs can take advantage of the extraCPUs, but only one of the CPUs can be executing kernel code at a time.

Whencode locking is used, there are a fixed number of locks in the system;this number can be used to characterize how much concurrency isavailable. On a very busy, highly configured system, the code locks arelikely to become bottlenecks, so that adding extra processors will nothelp performance and may actually reduce performance.

Data Locking and Solaris 2

Theproblem with code locks is that different processors often want to usethe same code to work on different data. To allow this use, locks mustbe placed in data structures rather than in code. Unfortunately, thissolution requires an extensive rewrite of the kernel—one reason whySolaris 2 took several years to create. The result is that the kernelhas a lot of concurrency available and can be tuned to scale well withlarge numbers of processors. The same kernel is used on uniprocessorsand multiprocessors, so that all device drivers and user programs mustbe written to work in the same environment and there is no need toconstrain concurrency for compatibility with uniprocessor systems. Thelocks are still needed in a uniprocessor because the kernel can switchbetween kernel threads at any time to service an interrupt.

Withdata locking, there is a lock for each instance of a data structure.Since table sizes vary dynamically, the total number of locks grows asthe tables grow, and the amount of concurrency that is available toexploit is greater on a very busy, highly configured system. Addingextra processors to such a system is likely to be beneficial. Solaris 2has hundreds of different data locks and multiplied by the number ofinstances of the data, there will typically be many thousand locks inexistence on a running system. As Solaris 2 is tuned for moreconcurrency, some of the remaining code locks are turned into datalocks, and “large” locks are broken down into finer-grained locks.There is a trade-off between having lots of mutexes for good MPscalability and few mutexes for reduced CPU overhead on uniprocessors.

Monitoring Solaris 2.6 Lock Statistics

In Solaris 2.6, thelockstat(1)command offers a kernel lock monitoring capability that is verypowerful and easy to use. The kernel can dynamically change its locksto start collecting information and return to the normal, highlyoptimized lock code when the collection is complete. There is someextra overhead while the locks are instrumented, but usefulmeasurements can be taken in a few seconds without disruption of thework on production machines. Data is presented clearly with severaloptions for how it is summarized, but you still must have a very goodunderstanding of Unix kernel architecture and Solaris 2, in particular,to really understand what is happening.

You should read thelockstat(1)manual page to see all its options, but in normal operation, it is runby the root user for the duration of a process for which you want tomonitor the effects. If you want to run it for an interval, use thesleepcommand. I had trouble finding an example that would cause lockcontention in Solaris 2.6. Eventually, I found that putting a C shellinto an infinite loop, as shown below, generates 120,000 or more systemcalls per second duplicating and shutting file descriptors. When twoshells are run at the same time on a dual CPU system,mpstatshows that hundreds of mutex spins/second occur on each CPU. This testis not intended to be serious! The system call mix was monitored withtruss -c, as shown in Figure 10-9. Withtruss running, the system call rate dropped to about 12,000 calls per second, so I ranlockstat separately fromtruss.

% while 1 
 end

Figure 10-9. System Call Mix for C Shell Loop

% truss -c -p 351
^C 
syscall      seconds   calls  errors 
open             .26    2868 
close            .43   17209 
getpid           .44   17208 
dup              .31   14341 
sigprocmask      .65   28690 
                ----     ---    ---
sys totals:     2.09   80316      0 
usr time:        .39 
elapsed:        6.51

The intention was to see what system calls were causing the contention, and it appears to be a combination ofdup andclose, possiblysigprocmask as well. Although there are a lot of calls togetpid, they should not cause any contention.

Theexample output shows that there were 3,318 adaptive mutex spins in 5seconds, which is 663 mutex stalls per second in total and whichmatches the data reported bympstat. The locks and callers are shown; the top lock seems to beflock_lock,which is a file-related lock. Some other locks that are not named inthis display may be members of hashed lock groups. Several of thecalling routines mention file or filesystem operations andsigprocmask, which ties up with the system call trace.

Ifyou see high levels of mutex contention, you need to identify both thelocks that are being contended on and the component of the workloadthat is causing the contention, for example, system calls or a networkprotocol stack. You can then try to change the workload to avoid theproblem, and you should make a service call to Sun to report thecontention. As Sun pushes multiprocessor scalability to 64 processorsand beyond, there are bound to be new workloads and new areas ofcontention that do not occur on smaller configurations. Each newrelease of Solaris 2 further reduces contention to allow even higherscalability and new workloads. There is no point reporting high mutexcontention levels in anything but the latest release of Solaris 2.

Figure 10-10. Example`lockstat` Output

Code View: Scroll / Show All

# lockstat sleep 5 


Adaptive mutex spin: 3318 events 

Count indv cuml rcnt     spin Lock                   Caller 
-------------------------------------------------------------------------------
  601  18%  18% 1.00        1 flock_lock             cleanlocks+0x10 
  302   9%  27% 1.00        7 0xf597aab0             dev_get_dev_info+0x4c 
  251   8%  35% 1.00        1 0xf597aab0             mod_rele_dev_by_major+0x2c 
  245   7%  42% 1.00        3 0xf597aab0             cdev_size+0x74 
  160   5%  47% 1.00        7 0xf5b3c738             ddi_prop_search_common+0x50 
  157   5%  52% 1.00       1 0xf597aab0             ddi_hold_installed_driver+0x2c 
  141   4%  56% 1.00        2 0xf5c138e8             dnlc_lookup+0x9c 
  129   4%  60% 1.00        6 0xf5b1a790             ufs_readlink+0x120 
  128   4%  64% 1.00        1 0xf5c46910             cleanlocks+0x50 
  118   4%  67% 1.00        7 0xf5b1a790             ufs_readlink+0xbc 
  117   4%  71% 1.00        6 stable_lock            specvp+0x8c 
  111   3%  74% 1.00        1 0xf5c732a8             spec_open+0x54 
  107   3%  77% 1.00        3 stable_lock            stillreferenced+0x8 
   97   3%  80% 1.00        1 0xf5b1af00             vn_rele+0x24 
   92   3%  83% 1.00       19 0xf5c732a8             spec_close+0x104 
   84   3%  86% 1.00        1 0xf5b0711c             sfind+0xc8 
   57   2%  87% 1.00        1 0xf5c99338             vn_rele+0x24 
   53   2%  89% 1.00       54 0xf5b5b2f8             sigprocmask+0xa0 
   46   1%  90% 1.00        1 0xf5b1af00             dnlc_lookup+0x12c 
   38   1%  91% 1.00        1 0xf5b1af00             lookuppn+0xbc 
   30   1%  92% 1.00        2 0xf5c18478             dnlc_lookup+0x9c 
   28   1%  93% 1.00       33 0xf5b5b4a0             sigprocmask+0xa0 
   24   1%  94% 1.00        4 0xf5c120b8             dnlc_lookup+0x9c 
   20   1%  95% 1.00        1 0xf5915c58             vn_rele+0x24 
   19   1%  95% 1.00        1 0xf5af1e58             ufs_lockfs_begin+0x38 
   16   0%  96% 1.00        1 0xf5915c58             dnlc_lookup+0x12c 
   16   0%  96% 1.00        1 0xf5c732a8             spec_close+0x7c 
   15   0%  97% 1.00        1 0xf5915b08             vn_rele+0x24 
   15   0%  97% 1.00       17 kmem_async_lock        kmem_async_dispatch+0xc 
   15   0%  97% 1.00        1 0xf5af1e58             ufs_lockfs_end+0x24 
   12   0%  98% 1.00        1 0xf5c99338             dnlc_lookup+0x12c 
   11   0%  98% 1.00        1 0xf5c8ab10             vn_rele+0x24 
    9   0%  98% 1.00        1 0xf5b0711c             vn_rele+0x24 
    8   0%  99% 1.00        3 0xf5c0c268             dnlc_lookup+0x9c 
    8   0%  99% 1.00        5 kmem_async_lock        kmem_async_thread+0x290 
    7   0%  99% 1.00        2 0xf5c11128             dnlc_lookup+0x9c 
    7   0%  99% 1.00        1 0xf5b1a720             vn_rele+0x24 
    5   0%  99% 1.00        2 0xf5b5b2f8             clock+0x308 
    5   0% 100% 1.00        1 0xf5b1a720             dnlc_lookup+0x12c 
    3   0% 100% 1.00        1 0xf5915b08             dnlc_lookup+0x12c 
    2   0% 100% 1.00        1 0xf5b5b4a0             clock+0x308  
    2   0% 100% 1.00        1 0xf5c149b8             dnlc_lookup+0x9c 
    2   0% 100% 1.00        1 0xf5c8ab10             dnlc_lookup+0x12c 
    1   0% 100% 1.00        1 0xf5b183d0             esp_poll_loop+0xcc 
    1   0% 100% 1.00        1 plocks+0x110           polldel+0x28 
    1   0% 100% 1.00        2 kmem_async_lock        kmem_async_thread+0x1ec 
    1   0% 100% 1.00        6 mml_table+0x18         srmmu_mlist_enter+0x18 
    1   0% 100% 1.00        3 mml_table+0x10         srmmu_mlist_enter+0x18 
--------------------------------------------------------------------------


Adaptive mutex block: 24 events 

Count indv cuml rcnt     nsec Lock                   Caller 
-------------------------------------------------------------------------------
    3  12%  12% 1.00    90333 0xf5b3c738             ddi_prop_search_common+0x50 
    3  12%  25% 1.00   235500 flock_lock             cleanlocks+0x10 
    2   8%  33% 1.00   112250 0xf5c138e8             dnlc_lookup+0x9c 
    2   8%  42% 1.00   281250 stable_lock            specvp+0x8c 
    2   8%  50% 1.00   107000 0xf597aab0             ddi_hold_installed_driver+0x2c 
    2   8%  58% 1.00    89750 0xf5b5b2f8             clock+0x308 
    1   4%  63% 1.00    92000 0xf5c120b8             dnlc_lookup+0x9c 
    1   4%  67% 1.00   174000 0xf5b0711c             sfind+0xc8 
    1   4%  71% 1.00   238500 stable_lock            stillreferenced+0x8 
    1   4%  75% 1.00   143500 0xf5b17ab8             esp_poll_loop+0x8c 
    1   4%  79% 1.00    44500 0xf5b183d0             esp_poll_loop+0xcc 
    1   4%  83% 1.00    81000 0xf5c18478             dnlc_lookup+0x9c 
    1   4%  88% 1.00   413000 0xf5c0c268             dnlc_lookup+0x9c 
    1   4%  92% 1.00  1167000 0xf5c46910             cleanlocks+0x50 
    1   4%  96% 1.00    98500 0xf5c732a8             spec_close+0x7c 
    1   4% 100% 1.00    94000 0xf5b5b4a0             clock+0x308 
--------------------------------------------------------------------------


Spin lock spin: 254 events 

Count indv cuml rcnt     spin Lock                   Caller 
-------------------------------------------------------------------------------
   48  19%  19% 1.00        3 atomic_lock+0x29e      rw_exit+0x34 
   45  18%  37% 1.00       25 atomic_lock+0x24e      rw_exit+0x34 
   44  17%  54% 1.00        1 atomic_lock+0x29d      rw_exit+0x34 
   35  14%  68% 1.00        1 atomic_lock+0x24e      rw_enter+0x34 
   32  13%  80% 1.00       19 cpus+0x44              disp+0x78 
   22   9%  89% 1.00        1 atomic_lock+0x29d      rw_enter+0x34 
   17   7%  96% 1.00       52 cpu[2]+0x44            disp+0x78 
    6   2%  98% 1.00       10 cpu[2]+0x44            setbackdq+0x15c 
    5   2% 100% 1.00        1 atomic_lock+0x29e      rw_enter+0x34 
--------------------------------------------------------------------------


Thread lock spin: 3 events  

Count indv cuml rcnt     spin Lock                   Caller 
-------------------------------------------------------------------------------
    3 100% 100% 1.00       19 0xf68ecf5b             ts_tick+0x8 
-------------------------------------------------------------------------------

Multiprocessor Hardware Configurations

Atany point in time, there exist CPU designs that represent the bestperformance that can be obtained with current technology at areasonable price. The cost and technical difficulty of pushing thetechnology further means that the most cost-effective way of increasingcomputer power is to use several processors. There have been manyattempts to harness multiple CPUs and, today, there are many differentmachines on the market. Software has been the problem for thesemachines. It is hard to design software that will be portable across alarge range of machines and that will scale up well when large numberof CPUs are configured. Over many years, Sun has shipped high unitvolumes of large-scale multiprocessor machines and has built up a largeportfolio of well-tuned applications that are optimized for the Sunarchitecture.

Themost common applications that can use multiprocessors are time-sharedsystems and database servers. These have been joined by multithreadedversions of CPU-intensive programs like MCAD finite element solvers,EDA circuit simulation packages, and other High Performance Computing(HPC) applications. For graphics-intensive applications where the Xserver uses a lot of the CPU power on a desktop machine, configuring adual CPU system will help—the X protocol allows buffering and batchedcommands with few commands needing acknowledgments. In most cases, theapplication sends X commands to the X server and continues to runwithout waiting for them to complete. One CPU runs the application, andthe other runs the X server and drives the display hardware.

Twoclasses of multiprocessor machines have some possibility of softwarecompatibility, and both have SPARC-based implementations. Forillustrations of these two classes, see Figure 10-11 and Figure 10-12.

Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network

Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor

Distributed Memory Multiprocessors

Distributedmemory multiprocessors, also known as Massively Parallel Processors(MPP), can be thought of as a network of uniprocessors. Each processorhas its own memory, and data must be explicitly copied over the networkto another processor before it can be used. The benefit of thisapproach is that there is no contention for memory bandwidth to limitthe number of processors. Moreover, if the network is made up ofpoint-to-point links, then the network throughput increases as thenumber of processors increases. There is no theoretical limit to thenumber of processors that can be used in a system of this type, butthere are problems with the high latency of the point-to-point links,and it is hard to find algorithms that scale with large numbers ofprocessors.

Shared Memory Multiprocessors

Ashared memory multiprocessor is much more tightly integrated than adistributed memory multiprocessor and consists of a fairly conventionalstarting point of CPU, memory, and I/O subsystem, with extra CPUs addedonto the central bus.

Thisconfiguration multiplies the load on the memory system by the number ofprocessors, and the shared bus becomes a bottleneck. To reduce theload, caches are always used and very fast memory systems and buses arebuilt. If more and faster processors are added to the design, the cachesize must increase, the memory system must be improved, and the busthroughput must be increased. Most small-scale MP machines support upto four processors. Larger ones support a few tens of processors. Withsome workloads, the bus or memory system will saturate before themaximum number of processors has been configured.

Specialcircuitry snoops activity on the bus at all times so that all thecaches can be kept coherent. If the current copy of some data is inmore than one of the caches, then it will be marked as being shared. Ifit is updated in one cache, then the copies in the other caches areeither invalidated or updated automatically. From a software point ofview, there is never any need to explicitly copy data from oneprocessor to another, and shared memory locations are used tocommunicate values among CPUs. The cache-to-cache transfers still occurwhen two CPUs use data in the same cache line, so such transfers mustbe considered from a performance point of view, but the software doesnot have to worry about it.

Thereare many examples of shared memory mainframes and minicomputers.SPARCbased Unix multiprocessors range from the dual-processor Ultra 2and the 4-processor E450, to the 30-processor E6000 and the64-processor E10000.

Thehigh-end machines normally have multiple I/O subsystems and multiplememory subsystems connected to a much wider central bus, allowing moreCPUs to be configured without causing bottlenecks in the I/O and memorysystems. The E10000 goes one step further by using a crossbar switch asits interconnect. Multiple transfers occur between CPU, memory, and I/Oon separate point-to-point data paths that all transfer dataconcurrently. This mechanism provides higher throughput, lowcontention, and low latency.

Anunavoidable fact of life is that the memory latency increases as moreCPUs are handled by a system design. Even with the same CPU moduleconfiguration, an Ultra 2 or E450 has lower main memory latency thandoes an E6000, which is in turn lower latency than an E10000. Theperformance of a single uniprocessor job is therefore higher on asmaller machine. The latency of a system increases at high load levels,and this load-dependent increase is more of an issue on a smallermachine. When comparing the E6000 and E10000, it seems that the E6000has an advantage up to 20 CPUs, and the E10000 is usually faster forloads that consume 24–28 CPUs. The E10000 is the only option from 30–64 CPUs, and it maintains a lower latency under heavy load than anyother system. So, from purely a performance point of view, it is farfaster to have a row of four CPU E450s than to use an E10000 that isdivided into many, separate, four-CPU domains.

Clusters of Shared Memory Machines

Inany situation, the most efficient configuration is to use the smallestnumber of the fastest CPUs communicating over the lowest latencyinterconnect. This configuration minimizes contention and maximizesscalability. It is often necessary to compromise for practical or costreasons, but it is clear that the best approach to building alarge-scale system is to use the largest shared memory systems you canget, then cluster them together with the smallest number of nodes. Theperformance of a cluster is often limited by activities that are toohard to distribute, and so performance degrades to the performance of asingle node for some of the time. If that node is as large as possible,then the overall efficiency is much higher.

Shared/ Distributed Memory Multiprocessors

Althoughthe two types of MP machines started out as very different approaches,the most recent examples of these types have converged.

Theshared memory machines all have large caches for each CPU. For goodperformance, it is essential for the working set of data andinstructions to be present in the cache. The cache is now analogous tothe private memory of a distributed memory machine, and many algorithmsthat have been developed to partition data for a distributed memorymachine are applicable to a cache-based, shared memory machine.

Thelatest distributed memory machines have moved from a software- orDMA-driven point-to-point link to an MMU-driven, cache-line-basedimplementation of distributed shared memory. With this system, thehardware is instructed to share a region of address space with anotherprocessor, and cache coherence is maintained over a point-to-pointlink. As the speed of these links approaches the speed of a sharedmemory backplane, the distributed memory system can be used to runshared memory algorithms more efficiently. Link speeds are comparableto the low end of shared memory backplane speeds, although theytypically have much higher latency. The extra latency is due to thephysical distances involved and the use of complex switching systems tointerconnect large numbers of processors. Large distributed memorysystems have higher latency than do smaller ones.

NUMA Multiprocessors

Thehybrid that has emerged is known as nonuniform memory access, or NUMA.NUMAs are shared memory, multiprocessor machines where the memorylatency varies. If the access is to memory in the local node, thenaccess is fast; if the access request has to travel over one or morecommunication links, then access becomes progressively slower. The nodeitself is often a single-board SMP with two to four CPUs. The SGIOrigin 2000 system uses a twin-CPU node; the various IntelPentium-based systems from Sequent, Data General, and others use a4-CPU node. However, it makes far more sense when building NUMA systemsto use an entire high-end server as a node, with 20 or more CPUs. Thebulk of high-end server system requirements are for 10 to 30 or soCPUs, and customers like to buy a half-configured system to allow forfuture expansion. For Sun this requirement can be satisfied by an SMPsuch as the E6000 or E10000, so Sun had no need to develop a NUMA-basedsystem for the 1996–1998 generation. NUMA-based competitors mustconfigure multiple nodes to satisfy today’s requirements, and users ofthese systems are suffering from the new problems that come with thistechnology.

Benchmarks such as the TCP-D data warehouse test—see http://www.tpc.org—haveshown that Sun’s SMP machines do beat the NUMA machines on price andperformance over the whole range of database sizes. This benchmark isideally suited to NUMA because the query structures are well understoodand the data layout can be optimized to keep the NUMA nodes wellbalanced. In real life, data warehouse queries are often constructed adhoc, and it is not possible to optimize data placement in advance. TheTPC-D benchmark is substantially overstating the performance of NUMAmachines as data warehouse engines because of this effect. Dataplacement is not an issue in one large SMP node. In general, theworkload on a system is not constant; it can vary from one query to thenext or from one day to the next. In a benchmark situation, a NUMAmachine can be optimized for the fixed benchmark workload but wouldneed to be continuously reoptimized for real-world workloads.

Atechnique used to speed up NUMA systems is data replication. The samedata can be replicated in every node so that it is always available ata short memory latency. This technique has been overused by somevendors in benchmarks, with hundreds of Mbytes of data duplicated overall the nodes. The effect is that a large amount of RAM is wastedbecause there are many nodes over which to duplicate the data. In a32-CPU system made up of 2-way or 4-way nodes, you might need toconfigure extra memory to hold 16 or 8 copies of the data. On a 32-CPUSMP system, there is a single copy. If there are any writes to thisdata, the NUMA machines must synchronize all their writes, asignificant extra load.

Forall these reasons, the best MPP or NUMA configuration should be builtby clustering the smallest number of the largest possible SMP nodes.

CPU Caches

Performancecan vary even among machines that have the same CPU and clock rate ifthey have different memory systems. The interaction between caches andalgorithms can cause performance problems, and an algorithm change maybe called for to work around the problem. This section providesinformation on the CPU caches and memory management unit designs ofvarious SPARC platforms so that application developers can understandthe implications of differing memory systems^[1]. This section may also be useful to compiler writers.

^[1] High Performance Computing by Keith Dowd covers this subject very well.

CPU Cache History

Historically,the memory systems on Sun machines provided equal access times for allmemory locations. This was true on the Sun-2™ machines and was true fordata accesses on the Sun-3/50™ and Sun-3/60 machines.

Equalaccess was achieved by running the entire memory system as fast as theprocessor could access it. On Motorola 68010 and 68020 processors, fourclock cycles were required for each memory access. On a 10 MHz Sun-2,the memory ran at a 400 ns cycle time; on a 20 MHz Sun-3/60, it ran ata 200 ns cycle time, so the CPU never had to wait for memory to beready.

SPARCprocessors run at higher clock rates and want to access memory in asingle cycle. Main memory cycle times have improved a little, and verywide memory systems can transfer a 32- or 64-byte block almost asquickly as a single word. The CPU cache is designed to cope with themismatch between the way DRAM can handle blocks of memory efficientlyand the way that the CPU wants to access one word at a time at muchhigher speed.

The CPU cache is an area of fast memory made from staticRAM, or SRAM. The cache transfers data from main DRAM memory using pagemode in a block of 16, 32, or 64 bytes at a time. Hardware keeps trackof the data in the cache and copies data into the cache when required.

Moreadvanced caches have multiple levels and can be split so thatinstructions and data use separate caches. The UltraSPARC CPU has a16-Kbyte instruction cache and separate 16-Kbyte data cache, which areloaded from a second-level 512-Kbyte to 4-Mbyte combined cache thatloads from the memory system. We first examine simple caches before welook at the implications of more advanced configurations.

Cache Line and Size Effects

ASPARCstation 2 is a very simple design with clearly predictablebehavior. More recent CPU designs are extremely complicated, so I’llstart with a simple example. The SPARCstation 2 has its 64 Kbytes ofcache organized as 2048 blocks of 32 bytes each. When the CPU accessesan address, the cache controller checks to see if the right data is inthe cache; if it is, then the CPU loads it without stalling. If thedata needs to be fetched from main memory, then the CPU clock iseffectively stopped for 24 or 25 cycles while the cache block isloaded. The implications for performance tuning are obvious. If yourapplication is accessing memory very widely and its accesses are missing the cache rather than hitting the cache, then the CPU may spend a lot of its time stopped. By changing the application, you may be able to improve the hit rate and achieve a worthwhile performance gain. The 25-cycle delay is known as the miss cost,and the effective performance of an application is reduced by a largemiss cost and a low hit rate. The effect of context switches is tofurther reduce the cache hit rate because after a context switch thecontents of the cache will need to be replaced with instructions anddata for the new process.

Applicationsaccess memory on almost every cycle. Most instructions take a singlecycle to execute, and the instructions must be read from memory; dataaccesses typically occur on 20–30 percent of the cycles. The effect ofchanges in hit rate for a 25-cycle miss cost are shown in Table 10-1and Figure 10-13 in both tabular and graphical forms. A 25-cycle miss cost implies that a hit takes 1 cycle and a miss takes 26 cycles.

Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

Table 10-1. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

Hit Rate Hit Time Miss Time Total Time Performance 100% 100% 0% 100% 100% 99% 99% 26% 125% 80% 98% 98% 52% 150% 66% 96% 96% 104% 200% 50%

Theexecution time increases dramatically as the hit rate drops. Although a96 percent hit rate sounds quite high, you can see that the programwill be running at half speed. Many small benchmarks like Dhrystone runat 100 percent hit rate; real applications such as databases are movingdata around and following linked lists and so have a poor hit rate. Itisn’t that difficult to do things that are bad for the cache, so it isa common cause of reduced performance.

Acomplex processor such as UltraSPARC I does not have fixed cache misscosts; the cost depends upon the exact instruction sequencing, and inmany cases the effective miss cost is reduced by complex hardwarepipelining and compiler optimizations. The on-chip, 16-Kbyte data cachetakes 1 cycle, the external cache between 1 and 7 clock cycles (it cansupply a word of data in every cycle but may stall the CPU for up to 7cycles if the data is used immediately), and main memory between 40 and50 cycles, depending upon the CPU and system clock rates. With thecache hit rate varying in a two-level cache, we can’t draw a simpleperformance curve, but relatively, performance is better than the curveshown in Figure 10-13 when the system is running inside the large second-level cache, and worse when running from main memory.

A Problem with Linked Lists

Applicationsthat traverse large linked lists cause problems with caches. Let’s lookat this in detail on a simple SPARCstation 2 architecture and assumethat the list has 5,000 entries. Each block on the list contains somedata and a link to the next block. If we assume that the link islocated at the start of the block and that the data is in the middle ofa 100-byte block, as shown in Figure 10-14, then we can deduce the effect on the memory system of chaining down the list.

Figure 10-14. Linked List Example

The code to perform the search is a tight loop, shown in Figure 10-15.This code fits in seven words, one or two cache lines at worst, so thecache is working well for code accesses. Data accesses occur when thelink and data locations are read. If the code is simply looking for aparticular data value, then these data accesses will occur every fewcycles. They will never be in the same cache line, so every data accesswill cause a 25-cycle miss that reads in 32 bytes of data when only 4bytes were wanted. Also, with a 64-Kbyte cache, only 2,048 cache linesare available, so after 1,024 blocks have been read in, the cache linesmust be reused. This means that a second attempt to search the listwill find that the start of the list is no longer in the cache.

Figure 10-15. Linked List Search Code in C

/* C code to search a linked list for a value and miss the cache a lot */ 

struct block {
        struct block *link;            /* link to next block */ 
        int pad1[11]; 
        int data;                      /* data item to check for */ 
        int pad2[12]; 
        } blocks[5000]; 

struct block *find(pb,value) 
        struct block *pb; 
        int value; 
        {
        while(pb)                /* check for end of linked list */ 
                {
                if (pb->data == value) /* check for value match */ 
                        return pb;     /* return matching block */ 
                pb = pb->link;         /* follow link to next block */ 
                } 
        return (struct block *)0; /* return null if no match */ 
        }

Theonly solution to this problem is an algorithmic change. The problemwill occur on any of the current generation of computer systems. Infact, the problem gets worse as the processor gets faster since themiss cost tends to increase because of the difference in speed betweenthe CPU and cache clock rate and the main memory speed. With morerecent processors, it may be possible to cache a larger linked list inexternal cache, but the on-chip cache is smaller than 64 Kbytes. Whatmay happen is that the first time through the list, each cache misscosts 50 cycles, and on subsequent passes it costs 8 cycles.

Thewhileloop compiles with optimization to just seven instructions, includingtwo loads, two tests, two branches, and a no-op, as shown in Figure10-16. Note that the instruction after a branch is always executed on aSPARC processor. This loop executes in 9 cycles (225 ns) on aSPARCstation 2 if it hits the cache, and in 59 cycles (1,475 ns) ifboth loads miss. On a 200 MHz UltraSPARC, the loop would run inapproximately 5 cycles (25 ns) if it hit the on-chip cache, 20 cycles(100 ns) in the off-chip cache, and 100 cycles (500 ns) from mainmemory the first time through.

Figure 10-16. Linked List Search Loop in Assembler

LY1:                           /* loop setup code omitted */ 
        cmp     %o2,%o1        /* see if data == value */ 
        be      L77016         /* and exit loop if matched */ 
        nop                    /* pad branch delay slot */ 
        ld      [%o0],%o0      /* follow link to next block */ 
        tst     %o0            /* check for end of linked list */ 
        bne,a   LY1            /* branch back to start of loop */ 
        ld      [%o0+48],%o2   /* load data in branch delay slot */

The memory latency test in Larry McVoy’slmbenchset of benchmarks is based on this situation. It gives a worst-casesituation of unrelated back-to-back loads but is very common in realapplication code. See also “The Cache-Aligned Block Copy Problem” on page 265 for a description of another cache problem.

For detailed information on specific cache architectures, see “SPARC CPU Cache Architectures” on page 258. Chapter 12, “Caches,” on page 299 provides a more detailed explanation of the way in which caches in general work.

[Cockcroft98] Chapter 10. Processors [Cockcroft98] Chapter 7. Applications [Cockcroft98] Chapter 8. Disks [Cockcroft98] Chapter 9. Networks [Cockcroft98] Chapter 12. Caches [Cockcroft98] Chapter 3. Performance Measurement [Cockcroft98] Chapter 2. Performance Management [Cockcroft98] Chapter 11. System Architectures [Cockcroft98] Chapter 1. Quick Tips and Recipes [Cockcroft98] Chapter 6. Source Code Optimization [Cockcroft98] Chapter 13. RAM and Virtual Memory [Cockcroft98] Chapter 14. Kernel Algorithms and Tuning [Cockcroft98] Chapter 16. The SymbEL Example Tools CHAPTER [Mullins02] Chapter 10. System Performance [Bernstein09] Chapter 10. Transactional Middleware Products and Standards ARM Cortex Processors Compete with Intel Atom in Netbook Market [Cockcroft98] Section 15.1——15.6 [Cockcroft98] Section 15.7——15.10 [Cockcroft98] Section 17.1——17.4 [Cockcroft98] Section 17.5——17.10 [Bernstein09] Chapter 9. Replication [Nemeth00] Chapter 28. Daemons CHAPTER I a

[Cockcroft98] Chapter 10. Processors

Chapter 10. Processors

Monitoring Processors

CPU Control Commands —psrinfo,psradm, andpsrset

The Load Average

The Data Behindvmstat andmpstat Output

Figure 10-1. Examplevmstat Output

Figure 10-2. Examplevmstat -s Raw Counters Output

Figure 10-3. Examplempstat Output

The Sources ofvmstat Data

Process Queues

Figure 10-4. Thesysinfo Process Queue Data Structure

Virtual Memory Counters

Figure 10-5. Thevminfo Memory Usage Data Structure

Paging Counters

Figure 10-6. Thecpu_vminfo Per CPU Paging Counters Structure

CPU Usage and Event Counters

Figure 10-7. Thecpu_sysinfo Per-CPU System Information Structure

Use ofmpstat to Monitor Interrupts and Mutexes

Figure 10-8. Monitoring All the CPUs withmpstat

Cache Affinity Algorithms

Unix on Shared Memory Multiprocessors

The Spin Lock or Mutex

Code Locking

Data Locking and Solaris 2

Monitoring Solaris 2.6 Lock Statistics

Figure 10-9. System Call Mix for C Shell Loop

Figure 10-10. Examplelockstat Output

Multiprocessor Hardware Configurations

Figure 10-11. Typical Distributed Memory Multiprocessor with Mesh Network

Figure 10-12. Typical Small-Scale, Shared Memory Multiprocessor

Distributed Memory Multiprocessors

Shared Memory Multiprocessors

Clusters of Shared Memory Machines

Shared/ Distributed Memory Multiprocessors

NUMA Multiprocessors

CPU Caches

CPU Cache History

Cache Line and Size Effects

Figure 10-13. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

Table 10-1. Application Speed Changes as Hit Rate Varies with a 25-Cycle Miss Cost

A Problem with Linked Lists

Figure 10-14. Linked List Example

Figure 10-15. Linked List Search Code in C

Figure 10-16. Linked List Search Loop in Assembler

CPU Control Commands —`psrinfo`,`psradm`, and`psrset`

The Data Behind`vmstat` and`mpstat` Output

Figure 10-1. Example`vmstat` Output

Figure 10-2. Example`vmstat -s` Raw Counters Output

Figure 10-3. Example`mpstat` Output

The Sources of`vmstat` Data

Figure 10-4. The`sysinfo` Process Queue Data Structure

Figure 10-5. The`vminfo` Memory Usage Data Structure

Figure 10-6. The`cpu_vminfo` Per CPU Paging Counters Structure

Figure 10-7. The`cpu_sysinfo` Per-CPU System Information Structure

Use of`mpstat` to Monitor Interrupts and Mutexes

Figure 10-8. Monitoring All the CPUs with`mpstat`

Figure 10-10. Example`lockstat` Output