[Cockcroft98] Chapter 12. Caches

来源:百度文库 编辑:神马文学网 时间:2024/03/28 23:45:06

Chapter 12. Caches

Cachesstore a small amount of information for repeated fast access. Manytypes of caches occur throughout computer hardware and softwaresystems. We’ll take a look at the common principles, with examples ofCPU cache, directory name lookup cache (DNLC), and name service cache.Then, we’ll discuss some caches used with local disks and caches usedwith networks.

Cache Principles

Cacheswork on two basic principles that should be quite familiar to you fromeveryday life. The first is that if you spend a long time going to getsomething and you think you may need it again soon, you keep it nearby.For example, if you are working on your car, you find a 15 mm spannerin your tool box, then crawl under the car. If you then need a 10 mmspanner, you don’t put the 15 mm back in the tool box, you leave itunder the car. When you have finished, there is a cache of tools in a pile under the car that make up your working set.If you allowed only one tool at a time under the car, you would waste alot of time going back to the tool box. However, if the pile of toolsgets too large, you may decide to put away a few tools that you thinkyou won’t need any more.

Thesecond principle is that when you get something, you can save time bygetting a few extra things that you think you might need soon. In thecar example, you could grab a handful of spanners each time, thussaving yourself from having to make a few trips. You find that the pileunder the car gets bigger much more quickly, you never use some tools,and it takes you longer to pick up and put down a handful of tools oneach trip to the tool box.

Thefirst principle, called “temporal locality,” depends on reusing thesame things over time. The second principle, called “spacial locality,”depends on using things at the same time that are located near eachother. Caches work well only if there is good locality in what you aredoing.

Another goodexample is the “cache” of groceries that you keep in your house thatsaves you a trip to the supermarket every time you want a bite to eat.To take this analogy a bit further, let’s say you are the kind ofperson who hates shopping and lives off tinned and frozen food that youbuy once a month. You have a very efficient cache, with few shoppingtrips, and you can spend all your spare time browsing the Internet.Now, let’s say that your mother comes to stay and tells you that youneed a diet of fresh fruit and salad instead. You can’t buy in bulk asthese items don’t freeze or keep, and you need to visit the shopsalmost every day. You now waste all your spare time shopping. Younotice one day that your local shop is now also on the Internet, andyou can use your browser to place orders for delivery. Now you can surfthe Internet all the time, keep a good diet, and never go groceryshopping again! In cache terms, this is an asynchronous prefetch ofstuff that you know you will need soon, but, by requesting it inadvance, you never have to stop and wait for it to arrive.

Thisanalogy illustrates that some sequences of behavior work veryefficiently with a cache, and others make little or no use of thecache. In some cases, “cache busting” behavior can be fixed by changingthe system to work in a different way that bypasses the normal cachefetch. These alternative methods usually require extra hardware orsoftware support in a computer system.

Beforegoing on to a specific kind of cache, let’s look at some generic termsand measurements that parameterize caches. When the cache is used tohold some kind of information, a read from the cache is very differentfrom writing new information, so we’ll take a look at cache readsfirst, then look at writes.

Cache Reads That Succeed

Manykinds of information can be cached: the value of a memory location, thelocation of a file, or the network address of another system. When theinformation is needed, the system first looks for it in the cache. Thisapproach requires that each item in the cache has a unique name or tagassociated with the information. When a cache lookup occurs, the firstthing that happens is a search during which the system tries to matchthe name you gave it with one of the tags. If the search succeeds, theassociated data is returned; this case is called a read hit. This situation is the best case, as it is both quick and successful. The two measures of interest are the read hit rate and the read hit cost.The read hit rate is the number of read hits that occur per second; theread hit cost is the time it takes to search the cache and return thedata.

CPU Cache Example

Fora CPU’s primary data cache, an address is used as the tag. Millions ofread hits occur each second; the cost is known as the load use delay,a constant small number of CPU clock cycles. When the CPU issues a loadinstruction to move data from memory to a register, it has to wait afew cycles before it can use the register. A good compiler will try tooptimize code by moving load instructions forward and filling in theload use delay with other work. It is hard to measure (and understand)what is going on in the CPU cache.

Directory Name Lookup Cache Example

For the directory name lookup cache (DNLC), anopensystem call causes a pathname string, e.g., “export,” to be looked upto find its inode. The rate at which this lookup occurs is reported bysar -a as namei/s;it can reach a few thousand per second on a busy file server. The readhit cost is relatively low and consumes only CPU time. An efficienthashed search is used, so the size of the search is not an issue.

Name Service Cache Example

The name service cache is implemented as a multithreaded daemon process (nscd). It caches password and group information for users, and the host IP address information for systems. For example, a call togethostbyname(3N) would ordinarily involve a search through the local/etc/hosts file, a call to a local NIS or NIS+ server, and perhaps a DNS lookup, as specified in/etc/nsswitch.conf. Thenscd provides a systemwide cache for the results of these searches, andgethostbyname accesses it through a very fast, new, interprocess call mechanism known as a door call. The read hit cost is only a few tens of microseconds, depending on your CPU speed.

Cache Reads That Miss the Cache

If the search fails and the data is not in the cache, the case is called a read miss. The rate at which this situation occurs is called the read miss rate, and the time taken is called the read miss cost.A read miss corresponds to the work that would be needed if the cachedid not exist, plus the added overhead of checking the cache. Readmisses are slow, and the read miss cost can be unpredictable, althoughit is usually possible to figure out the best case, the worst case, andthe average.

CPU Cache Example

CPUcaches come in many configurations. The most common setup is to havetwo separate primary caches built into the CPU chip, one to holdinstructions and one to hold data, and often holding about 16 Kbyteseach. These caches then refer to an external level-2 cache holdingbetween 256 Kbytes and 4 Mbytes of data. A read miss in the smallprimary cache may result in a hit or a miss in the level-2 cache. Alevel-2 hit takes a few cycles longer than a level-1 hit but is stillvery efficient. A level-2 miss involves a main memory reference, and ina multiprocessor system, the system must access the backplane and checkevery CPU cache in the system as well. If the backplane is already verybusy, there may be a considerable extra delay. This approach explainswhy heavily loaded multiprocessor servers run faster and scale betterwith the reduced number of read misses that come with larger CPUcaches. For example, Sun currently offers a choice of 512-Kbyte,1-Mbyte, 2-Mbyte, and 4-Mbyte caches on the UltraSPARC range. The4-Mbyte cache choice is always faster, but it will have aproportionately greater benefit on performance and scalability of a24-CPU E6000 than on a 4-CPU E3000.

Ananomaly can arise in some cases. It takes time to search the cache andto load new data into it, and these processes slow down the fetchoperation. If you normally operate in a cache-busting mode, youactually go faster if there is no cache at all. Since caches add to thecost of a system (extra memory used or extra hardware), they may not bepresent on a simpler or cheaper system. The anomaly is that a cheapersystem may be faster than a more expensive one for some kinds of work.The microSPARC processor used in the 110 MHz SPARCstation 5 is verysimple: it has a high cache miss rate but a low cache miss cost. TheSuperSPARC processor used in the SPARCstation 20 has much biggercaches, a far lower cache miss rate, but a much higher cache miss cost.For a commercial database workload, the large cache works well, and theSPARCstation 20 is much faster. For a large Verilog EDA simulation, thecaches are all too small to help, and the low latency to access memorymakes the SPARCstation 5 an extremely fast machine. The lessons learnedfrom this experience were incorporated in the UltraSPARC design, whichalso has very low latency to access memory. To get the low latency, thelatest, fastest cache memory and very fast, wide system buses areneeded.

Directory Name Lookup Cache Example

Ifwe consider our other examples, the DNLC is caching a file name. If thesystem cannot find out the corresponding inode number from the DNLC, ithas to read through the directory structures of the file system. Thisprocedure may involve a linear search through a UFS directory structureor a series ofreaddir calls over NFS toan NFS server. There is some additional caching of blocks of directorydata and NFS attributes that may save time, but often the search has tosleep for many milliseconds while waiting for several disk reads ornetwork requests to complete. You can monitor the number of directoryblocks read per second withsar -a as dirbk/s; you can also look at the number of NFS2readdir calls and NFS3readdirplus calls. NFS2 reads a single entry with eachreaddir call, whereas NFS3 adds thereaddirplus call that reads a series of entries in one go for greater efficiency (but longer latency).

Tosummarize the effects of a DNLC miss: A little extra CPU is consumedduring the search, but a lot of sleeping—waiting for the disk andnetwork—dominates the miss cost. DNLC hit rates (the number of readhits as a proportion of all lookups) are often in the 80%–100% range.If you run thefind command, whichwalks through the directory structure, you will discover that the hitrate drops to perhaps 30%–40%. My rule for the DNLC hit rate is a bittoo simple. It tells you to increase the DNLC size if it is fairlysmall, the hit rate drops below 80%, and the reference rate is above100/s. In some cases, a lot of new files are being created or afind-like operation is needed, and there is no point having a big DNLC.

Name Service Cache Example

A read miss on the name service cache causes a sequence of lookups by thenscd; the sequence is controlled by the/etc/nsswitch.conf file. If the file contains the entry:

hosts: files nisplus dns

then thenscd first lookup uses the standardfiles “backend” to look in/etc/hosts. In recent releases of Solaris 2, thefilesbackend itself caches the file and does a hashed lookup into it; so,very long host and passwd files are searched efficiently, and the readmiss could be quite quick and CPU bound with no waiting. A NIS or NIS+lookup involves a call over the network to a name server and so incursat least a few milliseconds of delay. If the lookup is eventuallypassed up a hierarchy of DNS servers across the Internet, the lookupcould take several seconds. The worst case is a lookup in a remote DNSdomain for a hostname that does not exist. Unfortunately, someapplications repeat the same lookup several times in quick succession,so a novel feature of thenscd is that it supports negative caching.It remembers that a lookup failed and tells you immediately that thehost does not exist if you ask a second time. There is a five secondtime-out for negative entries by default. The full output fromnscd -g (on a recently booted Solaris 2.5 system) is shown Figure 12-1.

Figure 12.1. Example Name Service Cache Daemon (nscd) Configuration
Code View: Scroll / Show All
% nscd -g 
nscd configuration:

0 server debug level
"/dev/null" is server log file

passwd cache:

Yes cache is enabled
507 cache hits on positive entries
0 cache hits on negative entries
55 cache misses on positive entries
2 cache misses on negative entries
89% cache hit rate
0 queries deferred
16 total entries
211 suggested size
600 seconds time to live for positive entries
5 seconds time to live for negative entries
20 most active entries to be kept valid
Yes check /etc/{passwd,group,hosts} file for changes
No use possibly stale data rather than waiting for refresh
group cache:

Yes cache is enabled
27 cache hits on positive entries
0 cache hits on negative entries
11 cache misses on positive entries
0 cache misses on negative entries
71% cache hit rate
0 queries deferred
5 total entries
211 suggested size
3600 seconds time to live for positive entries
5 seconds time to live for negative entries
20 most active entries to be kept valid
Yes check /etc/{passwd,group,hosts} file for changes
No use possibly stale data rather than waiting for refresh

hosts cache:

Yes cache is enabled
22 cache hits on positive entries
3 cache hits on negative entries
7 cache misses on positive entries
3 cache misses on negative entries
71% cache hit rate
0 queries deferred
4 total entries
211 suggested size
3600 seconds time to live for positive entries
5 seconds time to live for negative entries
20 most active entries to be kept valid
Yes check /etc/{passwd,group,hosts} file for changes
No use possibly stale data rather than waiting for refresh


Cache Replacement Policies

Sofar, we have assumed that there is always some spare room in the cache.In practice, the cache will fill up and at some point cache entrieswill need to be reused. The cache replacement policyvaries, but the general principle is that you want to get rid ofentries that you will not need again soon. Unfortunately, the “won’tneed soon” cache policy is hard to implement unless you have a verystructured and predictable workload or you can predict the future! Inits place, you will often find caches that use “least recently used”(LRU) or “not recently used” (NRU) policies, in the hope that youraccesses have good temporal locality. CPU caches tend to use “directmapping”—where there is no choice, the address in memory is used to fixan address in cache. For small, on-chip caches, there may be “n-wayassociative mapping”—where there are “n” choices of location for eachcache line, usually two or four, and an LRU or random choiceimplemented in hardware.

Randomreplacement (RR) policies seem to offend some of the engineeringpurists among us, as they can never work optimally, but I like thembecause the converse is true, and they rarely work badly either! Cacheswork well if the access pattern is somewhat random with some localityor if it has been carefully constructed to work with a particular cachepolicy (an example is SunSoft Performance WorkShop’s highly tuned mathlibrary). Unfortunately, many workloads have structured access patternsthat work against the normal policies. Random replacement policies canbe fast, don’t need any extra storage to remember the usage history,and can avoid some nasty interactions.

Cache Writes and Purges

Sofar, we have looked only at what is involved in reading existinginformation through a cache. When new information is created or oldinformation is overwritten or deleted, there are extra problems to dealwith.

The firstissue to consider is that a memory-based cache contains a copy of someinformation and the official master copy is kept somewhere else. Whenwe want to change the information, we have several choices. We couldupdate the copy in the cache only, which is very fast, but that copynow differs from the master copy. We could update both copiesimmediately, which may be slow, or we could throw away the cached copyand just update the master (if the cache write cost is high).

Anotherissue to consider is that the cache may contain a large block ofinformation and we want to update only a small part of it. Do we haveto copy the whole block back to update the master copy? And what ifthere are several caches for the same data, as in a CPU with severallevels of cache and other CPUs in a multiprocessor? There are manypossibilities, so I’ll just look at the examples we have discussedalready.

The CPUcache is optimized for speed by hardware that implements relativelysimple functions with a lot of the operations overlapped so they allhappen at once. On an UltraSPARC, a cache block contains 64 bytes ofdata. A write will change between 1 and 8 bytes at a time. Each blockcontains some extra flags that indicate whether the data is shared withanother CPU cache and whether the data has been written to. A writeupdates both the internal level-1 and the external level-2 caches. Itis not written out to memory, but the first time a shared block in amultiprocessor system is written to, a special signal, invalidating anycopies of the data that other CPUs hold, is sent to all the other CPUs.The flags are then changed from shared/clean to private/dirty. Ifanother CPU then tries to reread that data, it is provided directly ina cache-to-cache transfer and becomes shared/dirty. The data iseventually written back to main memory only when that cache block isneeded to hold different data by a read. In this case, the readoperation is held in an input buffer while the dirty cache block iscopied to an output buffer, then the read from memory completes, andfinally the dirty cache line write completes.

Anothersituation to consider is the case of a context switch or a processexiting. If an address space is no longer valid, the cache tags mayalso be invalid. For systems that store a virtual address in the cachetag, all the invalid addresses must be purged; for systems that storeonly a physical address in the cache tag, there is no such need. OlderSPARC systems, the HyperSPARC range, and some other systems such asHP-PA use virtual caches. SuperSPARC and UltraSPARC use physicalcaches. In general, a virtual cache can be faster for uniprocessorsystems and scientific workloads, and a physical cache scales muchbetter on multiprocessor systems and commercial workloads that havelots of context switches.

There are a lot more ways to implement caches. If you are interested, I recommend you read Computer Architecture — A Quantitative Approach by Hennessy and Patterson, published by Morgan Kaufmann.

Returningto our consideration of the DNLC: When a file is created or deleted,the new directory entry is written to disk or sent over NFSimmediately, so it is a slow operation. The extra overhead of updatingor purging the DNLC is very small in comparison.

The name service cache does not get involved in updates. If someone changes data in the name service, thenscd relies on time-outs and watching the modification dates on files to register the change.

Cache Efficiency

Acache works well if there are a lot more reads than writes, and thereads or writes of the same or nearby data occur close together intime. An efficient cache has a low reference rate (doesn’t makeunnecessary lookups), a very short cache hit time, a high hit ratio,the minimum possible cache miss time, and an efficient way of handlingwrites and purges.

Sometrade-offs can be made. A small, fast cache may have a lower hit ratiobut can compensate by having a low miss cost. A big, slower cache cancompensate for the extra miss cost by missing much less often.

Bewareof looking at hit ratios alone. A system may be running moreefficiently by referencing the cache less often. If there are multiplelevels of cache and the first level is working well, the second levelwill tend to see a higher proportion of misses but a low referencerate. It is actually the miss rate (in misses/second) multiplied by themiss cost (to give a total time delay) that matters more than the ratioof hits or misses as a percentage of the total references

Generic Cache Behavior Summary

I’vetried to illustrate some of the generic behavior of caches withexamples that show what is similar and what is different. Computersystems are constructed from a large number of hardware and softwarecaches of many types, and there can be huge differences in performancebetween a workload that works with a cache and one that works against acache. Usually, all we can tune is the cache size for software cacheslike the DNLC. This tuning has some effect, but it is much moreeffective to change your workload (don’t run thefindcommand any more often that you have to!) to work within the cache youhave. Your computer is spending more time stalled, waiting for somekind of cache miss to complete, than you think.

File Access Caching with Local Disk

Accessinga file on disk or over a network is hundreds of times slower thanreading a cached copy from memory. Many types of cache exist to speedup file accesses. Changing your workload to make it more “cachefriendly” can result in very significant performance benefits.

We’llstart by looking at the simplest configuration, the open, fstat, read,write, and mmap operations on a local disk with the default UFS filesystem, as shown in Figure 12-2.

Figure 12-2. File Access Caches for Local Disk Access


Thereare a lot of interrelated caches. They are systemwide caches, shared byall users and all processes. The activity of one cache-busting processcan mess up the caching of other well-behaved processes. Conversely, agroup of cache-friendly processes working on similar data at similartimes help each other by prefilling the caches for each other. Thediagram in Figure 12-2 shows the main data flows and relationships.

Directory Name Lookup Cache

Thedirectory name lookup cache (DNLC) is a cache of directory information.A directory is a special kind of file that contains names and inodenumber pairs. The DNLC holds the name and a pointer to an inode cacheentry. If an inode cache entry is discarded, any corresponding DNLCentries must also be purged. When a file is opened, the DNLC figuresout the right inode from the file name given. If the name is in thecache, the system does a fast hashed lookup if directories need not bescanned.

The UFSdirectory file structure is a sequence of variable length entriesrequiring a linear search. Each DNLC entry is a fixed size, so there isonly space for a pathname component of up to 30 characters. Longercomponents are not cached (many older systems like SunOS 4 cache onlyup to 14 characters). Directories that have thousands of entries cantake a long time to search, so a good DNLC hit rate is important iffiles are being opened frequently and very large directories are inuse. In practice, file opening is not usually frequent enough to be aserious problem. NFS clients hold a file handle that includes the inodenumber for each open file, so each NFS operation can avoid the DNLC andgo directly to the inode.

The maximum tested size of the DNLC is 34906, corresponding to the maximum allowedmaxuserssetting of 2048. The largest size the DNLC will reach with no tuning is17498, on systems with over 1 Gbyte of RAM. The size defaults to (maxusers * 17) + 90, andmaxusers is set to just under the number of megabytes of RAM in the system, with a default limit of 1024.

I find that people are overeager in tuningncsize;it really only needs to be increased manually on small-memory (512Mbytes or less) NFS servers, and even then, any performance increase isunlikely to be measurable unless you are running the SPECsfs NFS stresstest benchmark.

Inode Cache

Thefstatcall returns the inode information about a file. This informationincludes file size and datestamps, as well as the device and inodenumbers that uniquely identify the file. Every concurrently open filecorresponds to an active entry in the inode cache, so if a file is keptopen, its information is locked in the inode cache and is immediatelyavailable. A number (set by the tunableufs_ninode) of inactive inode entries are also kept.ufs_ninode is set, using the same calculation as that forncsize above, but the total size of the inode cache will be bigger becauseufs_ninode limits only the inactive entries.ufs_ninode doesn’t normally need tuning, but if the DNLC is increased, makencsize andufs_ninode the same.

Inactivefiles are files that were opened at some time in the past and that maybe opened again in the future. If the number of inactive entries growstoo large, entries that have not been used recently are discarded.Stateless NFS clients do not keep the inode active, so the pool ofinactive inodes caches the inode data for files that are opened by NFSclients. The inode cache entry also provides the location of every datablock on disk and the location of every page of file data in memory. Ifan inactive inode is discarded, all of its file data in memory is alsodiscarded, and the memory is freed for reuse. The percentage of inodecache entries that had pages when they were freed, causing cached filedata to be discarded, is reported bysar -g as%ufs_ipf. My inode cache rule described in “Inode Rule” on page 460 warns when non-zero values of%ufs_ipf are seen.

Theinode cache hit rate is often 90 percent or more, meaning that mostfiles are accessed several times in a short period of time. If you runa cache-busting command, likefind orls -R,that looks at many files once only, you will see a much lower DNLC andinode cache hit rate. An inode cache hit is quick because a hashedlookup finds the entry efficiently. An inode cache miss varies becausethe inode may be found in the UFS metadata buffer cache or because adisk read may be needed to get the right block of inodes into the UFSmetadata buffer cache.

UFS Metadata Buffer Cache

Thiscache is often just referred to as “the buffer cache,” but there hasbeen so much confusion about its use that I like to specificallymention metadata. Historically, Unix systems used a “buffer cache” tocache all disk data, assigning about 10 percent of total memory to thisjob. This changed in about 1988, when SunOS 4.0 came out with acombined virtual memory and I/O setup. This setup was later included inSystem V Release 4, and variants of it are used in most recent Unixreleases. The buffer cache itself was left intact, but it was bypassedfor all data transfers, changing it from a key role to a mostlyinconsequential role. Thesar -bcommand still reports on its activity, but I can’t remember the buffercache itself being a performance bottleneck for many years. As thetitle says, this cache holds only UFS metadata. This includes diskblocks full of inodes (a disk block is 8 Kbytes, an inode is about 300bytes), indirect blocks (used as inode extensions to keep track oflarge files), and cylinder group information (which records the way thedisk space is divided up between inodes and data). The buffer cachesizes itself dynamically, hits are quick, and misses involve a diskaccess.

In-memory Page Cache

Whenwe talk about memory usage and demand on a system, it is actually thebehavior of the in-memory page cache that is the issue. This cachecontains all data that is held in memory, including the files that makeup executable code and normal data files, without making anydistinction between them. A large proportion of the total memory in thesystem is used by this cache because it holds all the pages that makeup the current working set of the system as a whole. All page-in andpage-out operations occur between this cache and the underlying filesystems on disk (or over NFS). Individual pages in the cache maycurrently be unmapped (e.g., a data file) or may be mapped into theaddress space of many processes (e.g., the pages that make up thelibc.so.1shared library). Some pages do not correspond to a named file (e.g.,the stack space of a process); these anonymous pages have swap spacereserved for them so that they can be written to disk if required. Thevmstat andsar -pg commands monitor the activity of this cache.

The cache is made up of 4-Kbyte or 8-Kbyte page frames.Each page of data may be located on disk as a file system or swap spacedata block, or in memory in a page frame. Some page frames are readyfor reuse or empty and are kept on the free list (reported as free byvmstat).

Acache hit occurs when a needed page is already in memory; this hit maybe recorded as an attach to an existing page or as a reclaim if thepage was on the free list. A cache miss occurs when the page needs tobe created from scratch (zero fill fault), duplicated (copy on write),or read in from disk (page-in). Apart from the page-in, theseoperations are all quite quick, and all misses take a page frame fromthe free list and overwrite it.

Considera naive file-reading benchmark that opens a small file, then reads itto “see how fast the disk goes.” If the file was recently created, thenall of the file may be in memory. Otherwise, the first read-throughwill load it into memory. Subsequent runs may be fully cached with a100% hit rate and no page-ins from disk at all. The benchmark ends upmeasuring memory speed, not disk speed. The best way to make thebenchmark measure disk speed is to invalidate the cache entries byunmounting and remounting the file system between each run of the test.

Thecomplexities of the entire virtual memory system and paging algorithmare beyond the scope of this book. The key thing to understand is thatdata is evicted from the cache only if the free memory list gets toosmall. The data that is evicted is any page that has not beenreferenced recently, where recentlycan means a few seconds to a few minutes. Page-out operations occurwhenever files are written and also when data is reclaimed for the freelist because of a memory shortage. Page-outs occur to all file systemsbut are often concentrated on the swap space.

Disk Array Write Cache

Diskarray units, such as Sun’s SPARCstorage Array, or “Hardware RAID”subsystems, such as the RSM2000, contain their own cache RAM. Thiscache is so small in comparison to the amount of disk space in thearray that it is not very useful as a read cache. If there is a lot ofdata to read and reread, it would be better to add large amounts of RAMto the main system than to add it to the disk subsystem. The in-memorypage cache is a faster and more useful place to cache data. A commonsetup is to make reads bypass the disk array cache and to save all thespace to speed up writes. If there is a lot of idle time and memory inthe array, then the array controller may also look for sequential readpatterns and prefetch some read data, but in a busy array, thispractice can get in the way. The OS generates its own prefetch reads inany case, further reducing the need to do additional prefetch in thecontroller. Three main situations, described below, are helped by thewrite cache.

When alot of data is being written to a single file, it is often sent to thedisk array in small blocks, perhaps 2 Kbytes to 8 Kbytes in size. Thearray can use its cache to coalesce adjacent blocks, which means thatthe disk gets fewer larger writes to handle. The reduction in thenumber of seeks greatly increases performance and cuts service timesdramatically. This operation is safe only if the cache has batterybackup for its cache (nonvolatile RAM) because the operating systemassumes that when a write completes, the data is safely on the disk. Asan example, 2-Kbyte raw writes during a database load can go two tothree times faster.

Thesimple Unix write operation is buffered by the in-memory page cacheuntil the file is closed or data gets flushed out after 30 seconds.Some applications use synchronous writes to ensure that their data issafely on disk. Directory changes are also made synchronously. Thesesynchronous writes are intercepted by the disk array write cache andsafely stored in nonvolatile RAM. Since the application is waiting forthe write to complete, this approach has a dramatic effect, oftenreducing the wait from as much as 20 ms to as little as 2 ms. For theSPARCstorage Array, use thessaadmcommand to check that fast writes have been enabled on each controllerand to see if they have been enabled for all writes or just synchronouswrites. Thessaadm command defaults to off, so if someone has forgotten to enable fast writes, you could get a good speedup! Usessaadm to check the SSA firmware revision and upgrade it first. Use the copy in/usr/lib/firmware/ssa, or get a later version from the patch database.

Thefinal use for a disk array write cache is to accelerate the RAID5 writeoperations in Hardware RAID systems. This usage does not apply to theSPARCstorage Array, which uses a slower, software-based RAID5calculation in the host system. RAID5 combines disks, using parity forprotection, but during writes, the calculation of parity means that allthe blocks in a stripe are needed. With a 128-Kbyte interlace and a6-way RAID5, each full stripe cache entry would use 768 Kbytes. Eachindividual small write is then combined into the full stripe before thefull stripe is written back later on. This method needs a much largercache than that needed for performing RAID5 calculations at theper-write level, but it is faster because the disks see fewer largerreads and writes. The SPARCstorage Array is very competitive for use instriped, mirrored, and read-mostly RAID5 configurations, but its RAID5write performance is slow because each element of the RAID5 data isread into main memory for the parity calculation and then written back.With only 4 Mbytes or 16 Mbytes of cache, the SPARCstorage Arraydoesn’t have space to do hardware RAID5, although this is plenty ofcache for normal use. Hardware RAID5 units have 64 Mbytes or more(sometimes much more); see “Disk Workloads” on page 169 for more on this topic.

The Standard I/O Buffer

Simple text filters in Unix process data one character at a time, using theputchar andgetchar macros,printf and the relatedstdio.h routines. To avoid a system call for every read or write of one character,stdio uses a buffer to cache the data for each file. The buffer size is 1 Kbyte, so every 1024 calls ofgetchar, areadsystem call of 1 Kbyte will occur. Every 8 system calls, a filesystemblock will be paged in from disk. If your application is reading andwriting data in blocks of 1 Kbyte or more, there is no point in usingthestdio library—you can save time by using theopen/read/write calls instead offopen/fread/fwrite. Conversely, if you are usingopen/read/write for a few bytes at a time, you are generating a lot of unnecessary system calls andstdio would be faster.

Read, Write, and Memory Mapping

Whenyou read data, you must first allocate a buffer, then read into thatbuffer. The data is copied out of a page in the in-memory page cache toyour buffer, so there are two copies of the data in memory. Thisduplication wastes memory and wastes the time it takes to do the copy.The alternative is to usemmap tomap the page directly into your address space. Data accesses then occurdirectly to the page in the in-memory page cache, with no copying andno wasted space. The drawback is thatmmapchanges the address space of the process, which is a complex datastructure. With a lot of memory mapped files, the data structure getseven more complex. Themmap callitself is more complex than a read or write, and a complex addressspace also slows down the fork operation. My recommendation is to useread and write for short-lived or small files. Usemmap for random access to large, long-lived files where the avoidance of copying and reduction inread/write/lseek system calls offsets the initialmmap overhead.

Networked File Access

Parallelingthe discussion of local disk accesses, this section looks at networkedaccess. Accessing a file over a network is hundreds of times slowerthan reading a cached copy from memory. Many types of cache exist tospeed up file accesses. Changing your workload to make it more “cachefriendly” can result in very significant performance benefits.

NFS Access Caching

Again,we’ll start by looking at the simplest configuration, the open, fstat,read, write, and mmap operations on an NFS-mounted file system, asshown in Figure 12-3.

Figure 12-3. File Access Caches for NFS and Cachefs


Compared with the previous diagram of UFS access caching, the diagram in Figure 12-3has been split in the middle, pulled to each side, and divided betweenthe two systems. Both systems contain an in-memory page cache, but theNFS file system uses an rnode (remote node) to hold information aboutthe file. Like the UFS inode cache, the rnode keeps pointers to pagesfrom the file that are in the local page cache. Unlike the UFS inode,it does not contain the disk block numbers; instead, it holds the NFSfile handle. The handle has no meaning on the client, but on the serverit is used to locate the mount point and inode number of the file sothat NFS reads and writes can go directly to the right file on theserver. An NFS file open on the client causes a DNLC lookup on theclient; failure to find the file causes a DNLC lookup on the serverthat sets up both the DNLC entry and the rnode entry on the client, asshown in Figure 12-3.

Thereare a lot of interrelated caches. They are systemwide caches, shared byall users and all processes. The activity of one cache-busting processcan mess up the caching of other well-behaved processes. Conversely, agroup of cache-friendly processes working on similar data at similartimes help each other by prefilling the caches for each other. Thediagram in Figure 12-3 shows the main data flows and relationships.

Rnode Cache

ThelookupNFS call returns the rnode information about a file. This informationincludes the file size and datestamps, as well as the NFS file handlethat encodes server mount point and inode numbers that uniquelyidentify the file. Every concurrently open file corresponds to anactive entry in the rnode cache, so if a file is kept open, itsinformation is locked in the rnode cache and is immediately available.A number (set by the tunablenrnode) of rnode entries are kept.nrnode is set to twice the value ofncsize. It doesn’t normally need tuning, but if the DNLC is increased,nrnodeincreases as well. DNLC entries are filesystem independent; they referto entries in the UFS inode cache as well as to those in the NFS rnodecache.

SeveralNFS mount options affect the operation of NFS data and attributecaching. If you mount mail spool directories from an NFS server, youmay have seen a warning message that advises you to use thenoacoption. This option turns off attribute and write caching. Why wouldthis be a good idea? The access pattern for mail files is thatsendmail on the server appends messages to the file. Themailtool on an NFS client checks the file attributes to see if new mail has been delivered. If the attributes have been cached,mailtool will not see the change until the attribute time-out expires. The access pattern ofsendmail andmailtoolinvolves multiple processes writing to the same file, so it is not agood idea to cache written data on the client. This warning highlightstwo issues. One issue is that with multiple clients, there may bemultiple copies of the same data cached on the clients and the server,and NFS does not enforce cache coherency among the clients. The secondis that in situations where the cache is of no benefit, it can bedisabled. See themount_nfs(1M) manual page for full details on the attribute cache time-out options.

In-memory Page Cache on NFS Client

Whenwe talk about memory usage and demand on a system, it is actually thebehavior of this cache that is the issue. It contains all data that isheld in memory, including the files that make up executable code andnormal data files, without making any distinction between them. A largeproportion of the total memory in the system is used by this cachebecause it holds all the pages that make up the current working set ofthe system as a whole. All page-in and page-out operations occurbetween this cache and the underlying file systems on disk and overNFS. Individual pages in the cache may currently be unmapped (e.g., adata file) or may be mapped into the address space of many processes(e.g., the pages that make up thelibc.so.1shared library). Some pages do not correspond to a named file (e.g.,the stack space of a process); these anonymous pages have swap spacereserved for them so that they can be written to disk if required. Thevmstat andsar -pg commands monitor the activity of this cache.

The cache is made up of 4-Kbyte or 8-Kbyte page frames.Each page of data may be located in a file system (local or NFS) orswap space data block, or in memory in a page frame. Some page framesare ready for reuse or empty and are kept on the free list (reported asfree byvmstat).

Acache hit occurs when a needed page is already in memory; this hit maybe recorded as an attach to an existing page or as a reclaim if thepage was on the free list. A cache miss occurs when the page needs tobe created from scratch (zero fill fault), duplicated (copy on write),or read in from disk or over the network (page-in). Apart from thepage-in, these operations are all quite quick, and all misses take apage frame from the free list and overwrite it.

Page-outoperations occur whenever data is reclaimed for the free list becauseof a memory shortage. Page-outs occur to all file systems, includingNFS, but are often concentrated on the swap space (which may itself bean NFS-mounted file on diskless systems).

Disk Array Write Cache or Prestoserve

Whenthe client system decides to do an NFS write, it wants to be sure thatthe data is safely written before continuing. With NFS V2, each 8-KbyteNFS write is performed synchronously. On the server, the NFS write mayinvolve several disk writes to update the inode and indirect blocksbefore the write is acknowledged to the client. Files that are over afew Mbytes in size will have several indirect blocks randomly spreadover the disk. When the next NFS write arrives, it may make the serverrewrite the same indirect blocks. The effect of this process is thatwriting a large sequential file over NFS V2 causes perhaps three timesas many writes on the server, and those writes are randomlydistributed, not sequential. The network is idle while the writes arehappening. Thus, an NFS V2 write to a single disk will often show athroughput of 100 Kbytes/s or less over the network and 300 Kbytes/s ofwrites at the disk (about 40 random 8-Kbyte writes). And although thenetwork is only 10% busy, the disk may be 80% busy or more, and thedata rate being sustained is very poor.

Thereare several possible fixes for this situation. Increasing the amount ofdata written per NFS write increases throughput, but 8 Kbytes are thepractical maximum for the UDP-based transport used by older NFS2implementations. Providing nonvolatile memory on the server with aPrestoserve or a SPARC Storage Array greatly speeds up the responsesand also coalesces the rewrites of the indirect blocks. The amount ofdata written to disk is no longer three times that sent over NFS, andthe written data can be coalesced into large sequential writes thatallow several Mbytes/s to be sustained over a 100-Mbit network.

NFS V3 and TCP/IP Improvements

InSolaris 2.5, two new NFS features were introduced. The NFS version 3protocol uses a two-phase commit to safely avoid the bottleneck ofsynchronous writes at the server. Basically, the client has to bufferfor a longer time the data being sent, in case the commit fails.However, the amount of outstanding write data at the server isincreased to make the protocol more efficient.

Asa separate feature, the transport used by NFS V2 and V3 can now be TCPas well as UDP. TCP handles segment retransmissions for NFS, so thereis no need to resend the whole NFS operation if a single packet islost. This approach allows the write to be safely increased from 8Kbytes (6 Ethernet packets) to 32 Kbytes (20 Ethernet packets) andmakes operation over lossy wide-area networks practical. The larger NFSreads and writes reduce both protocol and disk seek overhead to givemuch higher throughput for sequential file accesses. The mount protocoldefaults to both NFS V3 and TCP/IP with 32-Kbyte blocks if they aresupported by both the client and the server, although NFS V3 over UDPwith 32-Kbyte blocks is a little faster on a “clean” local networkwhere retransmissions are unlikely.

Theuse of larger blocks with NFS V3 assumes that there is good spaciallocality. When the workload consists of small random reads, the extrasize of each transfer slows down the read rate. It may be helpful touse a mount option to reduce the block size to 8 Kbytes and tune theserver to allow a larger multiblock read-ahead.

The Cache File System

Cachefswas introduced in Solaris 2.3 and is normally used to reduce networktraffic by caching copies of NFS-mounted data to local disk. Once acache directory has been set up withcfsadmin(1M),file data is cached in chunks of 64 Kbytes for files of up to 3 Mbytestotal size (by default). Bigger files are left uncached, although itmay be worth checking the sizes of commonly used commands in yourenvironment to make sure that large cache-mounted executables (e.g.,Netscape Navigator) are not being excluded.

Whennew data is read, extra work must be done. The read is rounded up to a64-Kbyte chunk and issued to the NFS server. When the data returns, itis written to the cachefs store on disk. If the data is notsubsequently reread several times, a lot of extra work has gone towaste.

When data iswritten, any data in the cache is invalidated by default. A subsequentread is needed to reload the cachefs entry on disk. This requirement isnot too bad: A copy of the data is being written in RAM in the pagecache, so it is likely that subsequent reads will be satisfied from RAMin the short term.

Anotherissue is the relative speed and utilization of the local disk comparedto the NFS server. A fast NFS server, over a lightly loaded 100-Mbitnetwork, with striped NVRAM accelerated disks, may make cache missesfaster than cache hits to a slow, busy, local disk! If the disk is alsohandling paging to and from a swap partition, the activity from pagingand the cache may often be synchronized at the time a new applicationstarts up. The best use for cachefs is with a busy, low-bandwidthnetwork—a reduction in network load can dramatically improve clientperformance for all the users on the network.

The best file systems to cache are read-only or read-mostly. You can check the cache hit rate with thecachefsstat(1M)command that was included in Solaris 2.5. If you are running Solaris2.3 or 2.4, you can’t determine the hit rate, but I was getting between97% and 99% hit rate on a read-only, application-distribution NFSmount. A development of cachefs is the Solstice Autoclient application(see http://www.sun.com/Solstice).This application lets you run systems with local disks as cache-onlyclients. The local disks run the entire operating system installationfrom cache, avoiding the high network loads and low performanceassociated with diskless systems while providing highly automatedoperation with low administration costs.