[Cockcroft98] Chapter 8. Disks

来源:百度文库 编辑:神马文学网 时间:2024/04/24 22:33:58

Chapter 8. Disks

Theart of tuning modern computer systems is becoming more and moredependent on disk I/O tuning. This chapter shows how to measure andinterpret the disk utilization figures and suggests ways of improvingI/O throughput. The chapter describes many different types of disks andcontrollers found on Sun systems and also talks about tuning filesystems and combining disks into stripes and mirrors.

Disk Workloads

Thereare six different basic access patterns. Read, write, and updateoperations can either be sequential or randomly distributed. Sequentialread and write occur when files are copied and created or when largeamounts of data are being processed. Random read and write can occur inindexed database reads or can be due to page-in or page-out to a file.Update consists of a read-modify-write sequence and can be caused by adatabase system committing a sequence of transactions in either asequential or random pattern. When you are working to understand orimprove the performance of your disk subsystem, spend some time workingout which of these categories you expect to be most important.

Youcannot automatically tell which processes are causing disk activity;the kernel does not collect this information. You may be able to workout where the workload comes from by looking at how an application wasinstalled, but often you must resort to usingtruss on a process or the TNF tracing system. See “Tracing Applications” on page 155. The use of TNF to trace disk activity is covered in “The Solaris 2.5 Trace Capability”on page 188. I would like to see a way of getting I/O activity on aper-file descriptor basis added to Solaris, but until that happens,application-specific instrumentation is all you have. Databases such asOracle can collect and report data on a per-tablespace basis, so if youcan map the tablespaces to physical disks, you can tell what is goingon programmatically. This kind of collection and analysis is performedby the BGS Best/1 performance modeling tool, so that changes in diskworkload can be modeled.

Sequential versus Random Access

Somepeople are surprised when they read that a disk is capable of severalmegabytes per second but they see a disk at 100 percent capacityproviding only a few hundred kilobytes per second for theirapplication. Most disks used on NFS or database servers spend theirtime serving the needs of many users, and the access patterns areessentially random. The time taken to service a disk access is taken upby seeking to the correct cylinder and waiting for the disk to goaround. In sequential access, the disk can be read at full speed for acomplete cylinder, but in random access, the average seek time quotedfor the disk and the average rotational latency should be allowed forbetween each disk access. The random data rate is thus very dependenton how much data is read on each random access. For file systems, 8Kbytes is a common block size, but for databases on raw diskpartitions, 2 Kbytes is a common block size. Sybase always issues2-Kbyte reads, but Oracle tries to cluster 2-Kbyte accesses togetherwhenever possible to get a larger transfer size.

Forexample, a 5400 rpm disk takes 11 ms for a random seek, waits onaverage 5.6 ms for the disk to rotate to the right sector, and takesabout 0.5 ms for a 2-Kbyte transfer and 2 ms for an 8-Kbyte transfer(ignoring other SCSI bus transfers and software overhead). If the diskhas a sequential data rate of 4168 Kbytes/s, then the data rate inKbytes/s for a random seek and a single transfer is thus:

Theservice time for each access is the denominator, and in millisecondsyou can see that it increases from 17 ms for a 2-Kbyte I/O to 262 msfor a 1024-Kbyte I/O. The optimum size appears to be around 64 to 128Kbytes because the throughput is high, but the service time is stillquite reasonable.

Thesecalculations do not include time spent processing the disk request andfilesystem code in the operating system (under a millisecond); the timespent waiting in a queue for other accesses to be processed and sentover the SCSI bus; and the time spent inside the disk’s on-boardcontroller waiting for other accesses to be processed.

Anythingthat can be done to turn random access into sequential access or toincrease the transfer size will have a significant effect on theperformance of a system. This is one of the most profitable areas for performance tuning. You can greatly increase the throughput of your disk subsystem by increasing I/O size and reducing the number of seeks.

Disk Reads and Read Caching

Adisk read is normally synchronous. The application stops running untilthe read is complete. In some cases, a thread block or an asynchronousread (aioread(3)) call is used, and aread may be caused by a prefetch for information that has not beenrequested yet, but these are not the normal cases. The perceivedperformance of a system is often critically dependent on how quicklyreads complete. When the system reads data for the first time, thesystem should cache that data in case it is needed again. This cachingshould use the shared memory area for a database using raw disk orshould use the file system cache in main memory. That way, if the datais needed again, no read need occur. It is pointless putting large amounts of memory inside a hardware RAID controller and trying to cache reads with it;the first read will miss the cache, and subsequent references to thedata in main memory will never involve the disk subsystem. Data that isneeded quickly but infrequently will be purged from all the cachesbetween accesses. An application can avoid this purge by telling thedatabase to keep the data resident or telling the kernel to lock thefile in memory, but there is no way for an application to tell a disksubsystem what data should be cached for reads. The size of the cachefor reads should be limited to what is needed to prefetch data forsequential accesses. Modern SCSI disks contain 512 Kbytes to 2 Mbytesof RAM each and perform read prefetch automatically.

Disksubsystems from vendors such as EMC and Encore (now a Sun product) withgigabytes of NVRAM have been developed in the mainframe marketplace,where memory prices are high and memory capacities are (relatively)low. Disk subsystems have also been sold with Unix systems that havehad limited memory capacity, such as IBM AIX and HP’s HP-UX. Until late1997, AIX and HP-UX were limited to less than 4 Gbytes of total mainmemory. This problem does not apply to SPARC systems running Solaris 2.Sun’s largest server systems shipped with 5 Gbytes of RAM in 1993, 30Gbytes in 1996, and 64 Gbytes in 1997, and these systems use it allautomatically as a file cache. Sun also has comparable or lower costsper gigabyte of RAM compared to disk subsystem caches. With the Solaris2.6, 32-bit address space limitation, multiple 4-Gbyte processes cancoexist with unlimited amounts of cached file data. Database sharedmemory is limited in Solaris 2.6 to 3.75 Gbytes, but with a 64-bitaddress space version of Solaris due in 1998, much more can beconfigured.

If youare trying to decide whether to put an extra few gigabytes of RAM inmain memory or into a big disk subsystem controller, I would alwaysrecommend that you put it into main memory. You will be able to readthe RAM with submicrosecond latency and far higher throughput. You willreduce the total number of reads issued, saving on CPU time, interruptrates, and I/O bus loadings. And it will probably cost less.

Thebest disk subsystem for reads is therefore the one with the lowestlatency for initial reads of uncached data and highest throughput forsequential transfers. With care taken to cache data effectively in alarge main memory, the directly connected Sun A5000 FC-AL disksubsystem meets these criteria.

Disk Writes and the UFS Write Throttle

Diskwrites are a mixture of synchronous and asynchronous operations.Asynchronous writes occur during writes to a local file system. Data iswritten to memory and is flushed out to the file 30 seconds later bythefsflush daemon. Applications can ask for a file to be flushed by callingfsync(3C);closing the file causes a synchronous flush of the file. When a processexits, all of its open files are closed and flushed; this procedure cantake a while and may cause a lot of disk activity if many processesexit at the same time.

Thereis also a limit to the amount of unflushed data that can be written toa file. This limitation is implemented by the UFS write throttlealgorithm, which tries to prevent too much memory from being consumedby pending write data. For each file, between 256 Kbytes and 384 Kbytesof data can be pending. When there are less than 256 Kbytes (thelow-water markufs_LW), it is left tofsflush to write the data. If there are between 256 Kbytes and 384 Kbytes (the high-water markufs_HW),writes are scheduled to flush the data to disk. If more than 384 Kbytesare pending, then when the process attempts to write more, it issuspended until the amount of pending data drops below the low-watermark. So, at high data rates, writes change from being asynchronous tosynchronous, and this change slows down applications. The limitation isper-process, per-file.

Whenyou want to write quickly to a file and the underlying disk subsystemcan cope with the data rate, you may find that it is impossible todrive the disk to maximum throughput because the write throttle kicksin too early. An application that performs large writes at well-spacedintervals will also be affected, and if the writes are too big for thewrite throttle, the process will be suspended while the data iswritten. The high- and low-water marks are global, tunable values thatapply to the whole system, but, if necessary, they can be increased. Toexperiment with different levels without rebooting between trials, youcan change the values online with immediate effect by usingadb,but take care to increase the high-water mark first, so that it isalways greater than the low-water mark. Then, add modified values to/etc/system so that they take effect after a reboot; for example, to increase them by a factor of four, use the entries shown in Figure 8-1.A side effect of this change is that a larger number of I/O requestswill be queued for a busy file. This technique can increase throughput,particularly when a wide disk stripe is being used, but it alsoincreases the average I/O response time because the queue lengthincreases. Don’t increase the thresholds too far or disable the writethrottle completely, and remember that more RAM will be used by everyprocess that is writing data, so don’t increase the thresholds unlessyou can spare the extra memory required.

Figure 8-1. Example Tunable Settings for the UFS Write Throttle
set ufs:ufs_HW=1572864 
set ufs:ufs_LW=1048576

A simple way to see the effect of this change is to time themkfile command—I use the Solaris 2.5 and later -ptime(1) command because its precision is higher than that of the usualtime(1)command—and watch how long it takes to write a 1-Mbyte file (to afairly slow disk) with the default write throttle, compared to onewhere the low-water mark is changed to 1 Mbyte. The effect is that thetotal CPU time used is the same, but the elapsed time is greatlyreduced—from 0.588s to 0.238s.

# /usr/proc/bin/ptime mkfile 1M JUNK ; rm JUNK 

real 0.588
user 0.020
sys 0.155
# adb -kw
physmem3e39
ufs_HW/W0t1572864
ufs_HW:0x60000=0x180000
ufs_LW/W0t1048576
ufs_LW:0x40000=0x100000
^D
# /usr/proc/bin/ptime mkfile 1M JUNK ; rm JUNK

real 0.238
user 0.020
sys 0.156

Disk Write Caching

Alarge proportion of disk write activity is synchronous. Thissynchronicity delays application processes in cases where the writesare to raw disk partitions, are file system metadata such as UFS inodesand directory entries, are to files opened explicitly for synchronouswrites, are incoming NFS write requests, and in cases where the UFSwrite throttle has cut in. In many cases, the same disk block isrewritten over and over again, as in inode or directory updates. Inother cases, many small sequential writes to the same file areperformed. Synchronous writes are used because they safely commit thedata to disk, ensuring the integrity of log files, databases, and filesystems against data loss in the event of a power outage or systemcrash.

To getgood performance, we need a way to safely commit writes to NVRAM, wherethey can be coalesced into fewer and larger writes and don’t causeapplications to wait, rather than doing all the writes to disk.Nonvolatile RAM can be placed in the memory system, on an I/O bus, orin the disk subsystem controller. For simplicity and speed, the bestplace for the NVRAM is in the main memory system. This is an option onthe SPARCstation 20 and SPARCserver 1000/SPARCcenter2000 generation ofsystems, using the Prestoserve software and NVSIMM hardware. In thecurrent product range, the Netra NFS 1.2 is a packaged,high-performance, NFS server option. It uses a 32-Mbyte NVRAM card thatfits in the I/O bus (initially SBus, but PCI eventually) to acceleratea modified version of Solstice DiskSuite. The problem with thesesolutions is that if the system goes down, the cached data is heldinside the system. With the move to large, clustered, high-availabilitysolutions, it is important to provide shared access to the NVRAM frommultiple systems, so it must be placed in a multiply connected disksubsystem. This placement incurs the penalty of a SCSI operation foreach write, but since the write is stored in NVRAM and immediatelyacknowledged, this solution is still around ten times faster than awrite to disk.

Thebiggest benefit of disk array controllers comes from enabling fastwrites and coalescing writes to reduce the number of physical diskwrites. The amount of NVRAM required to perform write cachingeffectively is quite small. If sufficient throughput is available todrain the NVRAM to the underlying disks, a few megabytes are enough.The first-generation SPARCstorage Array only has 4 Mbytes of NVRAM butprovides good fast-write performance. The second generation with 16Mbytes has plenty. The Sun RSM2000 hardware RAID subsystem needs 64 to128 Mbytes so that it can cache complete stripes to optimize RAID5writes effectively. There is no need for more memory than this to actas a write cache. There is also no point in keeping the written data inthe array controller to act as a read cache because a copy of the datais likely to still be in the main memory system, which is a far moreeffective read cache.

Ifyou do not have any NVRAM in your configuration, a reasonablesubstitute is a completely dedicated log disk, using SolsticeDiskSuite’s metatrans option with UFS, or the Veritas VxFS file system.A dedicated log disk with a log file that spans only a few cylinders(100 Mbyes at most) will always have very short seeks, and you canlower write times to very few milliseconds, even for mirrored logdisks, as long as you resist the temptation to put something on theunused parts of the log disks.

Thebest disk subsystem for write-intensive workloads is the RSM2000 withits hardware RAID controller. An array controller with NVRAM ispromised as a future option for the A5000 FC-AL array; until then, youwill need to use log-based file systems with the A5000, so make sureyou have a couple of disks dedicated to making a small mirrored log—with no other activity at all on those disks.The best option is to configure a mixture of the two disk subsystems,so that you can have the fastest reads and the fastest writes asappropriate.

Physical Disk Combinations

Itis easy to overload a single disk with too many accesses. When somedisks have much higher loads than others, this situation is known as high I/O skew. The solution is to spread the accesses over several disks. There are many ways to do this.

  • Functionally split the usage of the disk, moving some kinds of data or tablespaces to another disk. This approach works well only if the workload is very stable and repeatable. It is often used in benchmarks but is not useful in the real world.

  • Split usage by users, perhaps on a mail server, locating mail files on different disks according to the user name. This approach again assumes a stable average workload.

  • Allocate data space on a round-robin basis over a number of disks as it is requested. Subsequent accesses will spread the load. This is the way that Solaris 2 handles multiple swap space files. The approach is crude but works well enough with application-specific support.

  • Concatenate several disks so that data is spread over all the disks. Accesses to “hot” data tend to overload just one of the concatenated disks. This situation occurs when a file system is grown online by addition of extra space at the end.

  • Interlace disks in a stripe, so that the “hot” data is spread over several disks. The optimal size of the interlace is application dependent. If the interlace size is too large, then the stripe accesses will not be balanced and one disk may become much busier than the others. If the interlace size is too small, then a single small access will need data from more than one disk, and this is slower than a single read. Start at 128 Kbytes (it is often a happy medium); then, if you want to experiment, try both larger and smaller interlace values. A disadvantage of stripes is that they cannot be grown online. Another stripe must be concatenated on the end. Also, if a disk fails, the whole stripe fails.

  • Stripe with mirrored protection so that all the data is stored in two places. This approach has a small additional performance impact on writes: two writes must be issued and both must complete, so the slowest disk determines the performance. This is known as RAID 1+0, or RAID 10. It is the fastest safe storage configuration.

  • Stripe with parity protection. Parity distributed over all the disks is known as RAID5. It uses fewer disks but is either far slower than mirroring or uses an expensive hardware RAID5 controller that may cost more than the extra disks that mirroring would require!

Ifthe controller is busy talking to one disk, then another disk has towait; the resulting contention increases the latency for all the diskson that controller. As the data rate of one disk approaches the busbandwidth, the number of disks you should configure on each bus isreduced. Fiber channel transfers data independently in both directionsat the same time and so handles higher loads with less contention.

Forsequential accesses, there is an obvious limit on the aggregatebandwidth that can be sustained by a number of disks. For randomaccesses, the bandwidth of the SCSI bus is much greater than theaggregate bandwidth of the transfers, but queuing effects increase thelatency of each access as more disks are added to each SCSI bus. If youhave too many disks on the bus, you will tend to see highwait values iniostat -x output. The commands queue in the device driver as they wait to be sent over the SCSI bus to the disk.

RAID and Disk Arrays

Disks can be logically combined in several ways. You can use hardware controllers and two different software packages.

TheSPARCstorage Array appears to the system as 30 disks on a singlecontroller that are then combined by means of software. Other diskarray subsystems combine their disks, using a fixed-configurationhardware controller so the disk combination appears as fewer very bigdisks.

The term RAID stands for Redundant Array of Inexpensive Disks,and several levels have been defined to cover the possibleconfigurations. The parity-based RAID configurations are a solution tothe problem of low cost, high availability, but do not give as goodperformance as lots of independent mirrored disks. Here is a usefulquote:

Fast, cheap, safe; pick any two.

This book is about performance, so if I start with fast, then to be fast and cheap, I just use lots of disks; and to be fast and safe, I use twice as many and mirror them. If performance is not the main concern, then a RAID5 combination is cheap and safe.

HardwareRAID controllers containing large amounts of NVRAM, such as Sun’sRSM2000 product, are in themselves fairly expensive items, but they canbe used to create fast and safe combinations. The best performancecomes when most operations occur to or from NVRAM. Sustained highthroughput usually saturates the controller itself, internal buses, orthe connection from the system to the hardware RAID unit. One approachis to increase the amount of NVRAM in the RAID controller. Thisapproach increases the ability to cope with short-term bursts, butdiminishing returns soon set in. For the write caching and coalescingrequired to make RAID5 operations efficient, the 64– 128 Mbytesprovided by the RSM2000 are plenty. Read caching is best handledelsewhere. If you need to cache a few gigabytes of data for reads, youshould add extra memory to the system itself, where it can be accessedat full speed and where the operating system and applications candecide what should be cached. Don’t put the extra memory in the RAIDcontroller where there is no control over what gets cached and where arelatively slow SCSI bus is in the way.

Largedisk arrays such as the EMC and the Sun/Encore SP-40 units areconfigured with many connections to the system. This configurationhelps throughput, but the basic latency of a SCSI request is the sameand is hundreds of times slower than a main memory request. Smallerarrays like the RSM2000 have two 40-Mbyte/s connections to the server,and many RSM2000 arrays can be connected to a system, increasingthroughput and write cache proportionally as disk capacity is added.Much higher performance can be obtained from multiple RSM2000subsystems than from a single larger system with the same disk capacity.

TheEMC Symmetrix has become quite common on big Sun server configurations.The Encore SP40 is a similar product on a feature-by-feature basis butis less well known due to the smaller market presence of Encore. Thissituation is about to change—Sun has purchased the storage business ofEncore and will soon be selling the SP40 in direct competition withEMC. Since benchmarks have shown that a single RSM2000 is capable ofhigher performance than an EMC (and probably also an SP40), it is clearthat the attraction of these disk subsystems is not primarilyperformance driven. Their feature set is oriented toward highserviceability, automatically dialing out to report a problem, andheterogeneous connectivity, linking simultaneously to mainframes andmultiple Unix systems. Where a lot of sharing is going on, there is acase for putting a larger cache in the disk subsystem so that writesfrom one system can be cached for read by another system. This form ofread caching is much slower than host-based caching but does centralizethe cache. This class of system typically supports up to 4 Gbytes ofcache.

A newnaming scheme from Sun refers to the RSM2000 as the A3000, the newFC-AL subsystem as the A5000, and the SP40 as the A7000. All three disksubsystems will eventually have 100 Mbyte/s fiber channel capabilityand can be connected via fiber channel switches into an “enterprisenetworked storage subsystem.” Clustered systems that need to shareaccess to the same disk subsystem to run parallel databases can usethese switches to have many systems in the cluster.

Disk Configuration Trade-off

Let’ssay you’re setting up a large server that replaces several smaller onesso it will handle a bit of everything: NFS, a couple of largedatabases, home directories, number crunching, intranet home pages. Doyou just make one big file system and throw the lot in, or do you needto set up lots of different collections of disks?

Severaltrade-offs need to be made, and there is no one solution. The mainfactors to balance are the administrative complexity, resilience todisk failure, and the performance requirements of each workloadcomponent.

There aretwo underlying factors to take into account: the filesystem type andwhether the access pattern is primarily random or sequential access.I’ve shown these factors as a two-by-two table, with the typicalworkloads for each combination shown (see Table 8-1).I would combine NFS, home directories, and intranet home pages into therandom access filesystem category. Databases have less overhead andless interaction with other workloads if they use separate raw disks.Databases that don’t use raw should be given their own filesystemsetup, as should number-crunching applications that read/write largefiles.

Table 8-1. Workload Characteristics
Workloads Random Sequential Raw disk Database indexes Database table scans File system Home directories Number crunching

Weneed to make a trade-off between performance and complexity. The bestsolution is to configure a small number of separate disk spaces, eachoptimized for a type of workload. You can get more performance byincreasing the number of separate disk spaces, but if you have fewer,it is easier to be able to share the spare capacity and add more disksin the future without having to go through a major reorganization.Another trade-off is between cost, performance, and availability. Youcan be fast, cheap, or safe; pick one, balance two, but you can’t haveall three. One thing about combining disks into an stripe is that themore disks you have, the more likely it is that one will fail and takeout the whole stripe.

Ifeach disk has a mean time between failure (MTBF) of 1,000,000 hours,then 100 disks have a combined MTBF of only 10,000 hours (about 60weeks). If you have 1,000 disks, you can expect to have a failure everysix weeks on average.Also note that the very latest, fastest disks are much more likely tofail than disks that have been well debugged over an extended period,regardless of the MTBF quoted on the specification sheet. Most Sundisks have an MTBF of around 1,000,000 hours. Failure rates in theinstalled base are monitored and, in most cases, improve on the quotedMTBF. Problems sometimes occur when there is a bad batch of disks. Aparticular customer may have several disks from the bad batch andexperience multiple failures, while many other customers have very fewfailures. Bad batches and systematic problems are caught duringtesting, and Sun’s Enterprise Test Center normally contains around 50terabytes of disk undergoing maximum system configuration tests. Forexample, prior to the launch of the A5000, which has FC-AL connectionsall the way to a new kind of disk, tens of terabytes of FC-AL diskswere running in the Enterprise Test Center for many months.

Thereare two consequences of failure: one is loss of data, the other isdowntime. In some cases, data can be regenerated (from databaseindexes, number-crunching output, or backup tapes), so if you canafford the time it takes to restore the data and system failure isunlikely to happen often, there is no need to provide a resilient disksubsystem. In that case, you can configure for the highest performance.If data integrity or high availability is important, there are twocommon approaches: mirroring and parity (typically RAID5). Mirroringaffords the highest performance, especially for write-intensiveworkloads but requires twice as many disks for its implementation.Parity uses one extra disk in each stripe to hold the redundantinformation required to reconstruct data after a failure. Writesrequire read-modify-write operations, and there is the extra overheadof calculating the parity. Sometimes, the cost of a high-performanceRAID5 array controller exceeds the cost of the extra disks you wouldneed to do simple mirroring. To achieve high performance, thesecontrollers use nonvolatile memory to perform write-behind safely andto coalesce adjacent writes into single operations. Implementing RAID5without nonvolatile memory will give you verypoor write performance. The other problem with parity-based arrays isthat when a disk has failed, extra work is needed to reconstruct themissing data and performance is seriously degraded.

Choosing Between Solstice DiskSuite and Veritas VxVM

Smallconfigurations and ones that do not have any NVRAM in the disksubsystem should use Solstice DiskSuite. It is bundled with the serverlicence for Solaris, and it has the metatrans logging filesystem optionthat substitutes for NVRAM. It is also simpler to use for smallconfigurations.

Largeconfigurations and those that have NVRAM in the disk subsystem shoulduse VxVM. It is bundled with the SPARCstorage Array, and it is easierto use and has fewer limitations for large configurations. To get alogging file system, you must also use the optional VxFS file system,although NVRAM provides a high-performance log for database raw diskaccesses. VxFS is better than UFS for very large files, largedirectories, high throughput, and features like snapshot backup.

Itis quite possible to use both volume managers on the same system, withDiskSuite handling logging for UFS file systems, and VxVM handlinglarge raw database tables.

Setting Up a Disk for Home, NFS, and the Web

Iassume that the file system dedicated to home directories, NFS, and webserver home pages is mostly read intensive and that some kind of arraycontroller is available (a SPARCstorage Array or RSM2000) that hasnonvolatile memory configured and enabled. Note that SPARCstorageArrays default to having NVRAM disabled. You use thessaadmcommand to turn on fast writes. The nonvolatile fast writes greatlyspeed up NFS response times for writes, and, as long as high throughputapplications do not saturate this file system, it is a good candidatefor a RAID5 configuration. The extra resilience saves you from dataloss and keeps the end users happy, without wasting disk space onmirroring. The default UFS filesystem parameters should be tunedslightly, inasmuch as there is no need to waste 10% on free space andalmost as much on inodes. I would configure 1% or 2% free space(default is 10%) and 8-Kbyte average file size per inode (default is 2Kbytes) unless you are configuring a file system that is under 1 Gbytein size. The latest Solaris release automatically reduces the freespace percentage when large file systems are created.

# newfs -i 8192 -m 1 /dev/raw_big_disk_device

You create theraw_big_disk_deviceitself by combining groups of disks, using the RSM2000’s RM6configuration tool or Veritas VxVM, into RAID5 protected arrays, thenconcatenating the arrays to make the final file system. To extend thefilesystem size in the future, make up a new group of disks into aRAID5 array and extend the file system onto it. You can grow a filesystem online if necessary, so there is no need to rebuild the wholearray and restore it from backup tapes. Each RAID5 array should containbetween 5 and 30 disks. I’ve personally used a 25-disk RAID5 setup on aSPARCstorage Array. For highest performance, keep the setup to thelower end of this range and concatenate more smaller arrays. We havefound that a 128-Kbyte interlace is optimal for this largely randomaccess workload. Solstice DiskSuite (SDS) cannot concatenate or extendRAID5 arrays, so if you are using SDS, make one big RAID5 array andsave/restore the data if the array needs to grow.

Another issue to consider is the filesystem check required on reboot. If the system shut down cleanly,fsckcan tell that it is safe to skip the check. If the system went down ina power outage or crash, tens of minutes to an hour or more could berequired to check a really huge file system. The solution is to use alogging file system, where a separate disk stores all the changes. Onreboot,fsck reads the log in justa few seconds, and the checking is done. With SDS, this method is setup by means of a “metatrans” device and the normal UFS. In fact, apreexisting SDS-hosted file system can have the metatrans log addedwithout any disruption to the data. With the Veritas Volume Manager,you must use the Veritas file system, VxFS, because the logging hooksin UFS are SDS specific. For good performance, put the log on adedicated disk, and, for resilience, mirror the log. In extreme cases,the log disk might saturate and require striping over more than onedisk. In very low usage cases, you can situate the log in a smallpartition at the start of a data disk, but this solution can hurtperformance a lot because the disk heads seek between the log and thedata.

An example end result is shown in Table 8-2.To extend the capacity, you would make up another array of disks andconcatenate it. There is nothing to prevent you making each array adifferent size, either. Unless the log disk maxes out, a mirrored pairof log disks should not need to be extended.

Table 8-2. Concatenated RAID5 Configuration Example
Log 1 Log 2                 Array 1 Array 1 Array 1 Array 1 Array 1 Array 1 Array 1 Array 1 Array 1 Array 1 Array 2 Array 2 Array 2 Array 2 Array 2 Array 2 Array 2 Array 2 Array 2 Array 2 Array 3 Array 3 Array 3 Array 3 Array 3 Array 3 Array 3 Array 3 Array 3 Array 3

Setting Up a Disk for High-Performance Number Crunching

High-throughputapplications like number crunching that do large, high-speed,sequential reads and writes should be set up to use a completelyseparate collection of disks on their own controllers. In most cases,data can be regenerated if there is a disk failure, or occasionalsnapshots of the data can be compressed and archived into the homedirectory space. The key thing is to off-load frequent I/O-intensiveactivity from the home directories into this high-performance “scratchpad” area. Configure as many fast-wide (20 MB/s) or Ultra-SCSI (40MB/s) disk controllers as you can. Each disk should be able to streamsequential data at between 5 and 10 Mbytes/s, so don’t put too manycontrollers on each bus. Nonvolatile memory in array controllers mayhelp in some cases, but it may also get in the way. The cost may alsotempt you to use too few controllers. A large number of SBus SCSIinterfaces is a better investment for this particular workload.

Ifyou need to run at sustained rates of more than 20–30 Mbytes/s ofsequential activity on a single file, you will run into problems withthe default UFS file system. The UFS indirect block structure and datalayout strategy work well for general-purpose accesses like the homedirectories but cause too many random seeks for high-speed, sequentialperformance. The Veritas VxFS file system is an extent-based structure,which avoids the indirect block problem. It also allows individualfiles to be designated as “direct” for raw unbuffered access. Thisdesignation bypasses the problems caused by UFS trying to cache allfiles in RAM, which is inappropriate for large sequential access filesand stresses the pager. It is possible to get close to the theoreticallimit of the I/O buses and backplane of a machine with a carefullysetup VxFS configuration and a lot of fast disks.

Sun’sHPC 2.0 software comes with a high-performance, parallel, distributedfile system called PFS. The programming interface to PFS is via theMessage Passing Interface I/O (MPI-IO) library, or the High PerformanceFortran (HPF) language, which allows large data arrays to bedistributed and written in parallel over a cluster. PFS uses its ownon-disk format that is optimized for this purpose.

Alog-based file system may slow down high-speed sequential operations bylimiting them to the throughput of the log. It should log onlysynchronous updates, like directory changes and file creation/deletion,and it is unlikely to be overloaded, so use a log to keep reboot timesdown.

Setting Up a Disk for Databases

Databaseworkloads are very different again. Reads may be done in small randomblocks (when looking up indexes) or large sequential blocks (when doinga full table scan). Writes are normally synchronous for safe commits ofnew data. On a mixed workload system, running databases through thefile system can cause virtual memory “churning” because of the highlevels of paging and scanning associated with filesystem I/O. Thisactivity can affect other applications adversely, so where possible,use raw disks or direct unbuffered I/O on a file system that supportsit, such as UFS (a new feature for Solaris 2.6) and VxFS.

BothOracle and Sybase default to a 2-Kbyte block size. A small block sizekeeps the disk service time low for random lookups of indexes and smallamounts of data. When a full table scan occurs, the database may readmultiple blocks in one operation, causing larger I/O sizes andsequential patterns.

Databaseshave two characteristics that are greatly assisted by an arraycontroller that contains nonvolatile RAM. One characteristic is that alarge proportion of the writes are synchronous and are on the criticalpath for user-response times. The service time for a 2-Kbyte write isoften reduced from about 10–15 ms to 1–2 ms. The other characteristicis that synchronous sequential writes often occur as a stream of smallblocks, typically of only 2 Kbytes at a time. The array controller cancoalesce multiple adjacent writes into a smaller number of much largeroperations, and that group can be written to disk far faster.Throughput can increase by as much as three to four times on a per-diskbasis.

Data integrity isimportant, but some sections of a database can be regenerated after afailure. You can trade off performance against availability by makingtemporary tablespaces and perhaps indexes out of wide, unprotectedstripes of disks. Tables that are largely read-only or that are not onthe critical performance path can be assigned to RAID5 stripes. Safe,high-performance writes should be handled by mirrored stripes.

Thesame basic techniques described in previous sections can be used toconfigure the arrays. Use concatenations of stripes—unprotected, RAID5,or mirrored—with a 128-Kbyte interlace.

A Last Word

Ihave scratched the surface of a large and possibly contentious subjecthere. I hope this information gives you the basis for a solution. Theimportant thing is that I divided the problem into subproblems byseparating the workloads according to their performancecharacteristics. Next, I proposed a solution for each, based upon anappropriate balance of performance, cost, and availability. FInally, Iimplemented distinct blocks of disks configured to support eachworkload.

Disk Load Monitoring

Ifa system is under a heavy I/O load, then the load should be spreadacross as many disks and disk controllers as possible. To see what theload looks like, use theiostat command or one of the SE toolkit variants.

Output Formats and Options foriostat

Theiostat command produces output in many forms. Theiostat -xvariant provides extended statistics and is easier to read when a largenumber of disks are being reported, since each disk is summarized on aseparate line (see Figure 8-2).The values reported are the number of transfers and kilobytes persecond, with read and write shown separately; the average number ofcommands waiting in the queue; the average number of commands activelybeing processed by the drive; the I/O service time; and the percentagesof the time that commands were waiting in the queue; and commands thatwere active on the drive.

Figure 8-2.iostat -x Output
Code View: Scroll / Show All
% iostat -txc 5 
extended disk statistics tty cpu
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b tin tout us sy wt id
fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 77 42 9 9 39
sd0 0.0 3.5 0.0 21.2 0.0 0.1 41.6 0 14
sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
extended disk statistics tty cpu
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b tin tout us sy wt id
fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 84 37 17 45 1
sd0 0.0 16.8 0.0 102.4 0.0 0.7 43.1 2 61
sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd3 0.0 1.0 0.0 8.0 0.0 0.1 114.3 2 4


Diskconfigurations have become extremely large and complex on big serversystems. The maximum configuration E10000 supports several thousanddisk drives, but dealing with even a few hundred is a problem. Whenlarge numbers of disks are configured, the overall failure rate alsoincreases. It can be hard to keep an inventory of all the disks, andtools like Solstice Symon depend upon parsing messages from syslog tosee if any faults are reported. The size of each disk is also growing.When more than one type of data is stored on a disk, it becomes hard towork out which disk partition is active. A series of new features hasbeen introduced in Solaris 2.6 to help solve these problems.

  • Per-partition data identical to existing per-disk data. It is now possible to separate out root, swap, and home directory activity even if they are all on the same disk.

  • New “error and identity” data per disk, so there is no longer a need to scan syslog for errors. Full data is saved from the first SCSI probe to a disk. This data includes Vendor, Product, Revision, Serial number, RPM, heads, and size. Soft, hard, and transport error counter categories sum up any problems. The detail option adds Media Error, Device not ready, No device, Recoverable, Illegal request, and Predictive failure analysis. Dead or missing disks can still be identified because there is no need to send them another SCSI probe.

  • New iostat options are provided to present these metrics. One option ( iostat -M) shows throughput in Mbytes/s rather than Kbytes/s for high-performance systems. Another option ( -n) translates disk names into a much more useful form, so you don’t have to deal with the sd43b format—instead, you get c1t2d5s1. This feature makes it much easier to keep track of per-controller load levels in large configurations.

Fasttapes now match the performance impact of disks. We recently ran a tapebackup benchmark to see if there were any scalability or throughputlimits in Solaris, and we were pleased to find that the only real limitis the speed of the tape drives. The final result was a backup rate ofan Oracle database at 1 terabyte/hour. This result works out at about350 Mbytes/s, which was as fast as the disk subsystem we had configuredcould go. To sustain this rate, we used every tape drive we could layour hands on, including 24 StorageTEK Redwood tape transports, whichrun at around 15 Mbytes/s each. We ran this test using Solaris 2.5.1,but there are no measurements of tape drive throughput in Solaris2.5.1. Tape metrics have now been added to Solaris 2.6, and you can seewhich tape drive is active, the throughput, average transfer size, andservice time for each tape drive.

Tapes are instrumented the same way as disks; they appear insar andiostatautomatically. Tape read/write operations are instrumented with all thesame measures that are used for disks. Rewind and scan/seek are omittedfrom the service time.

The output format and options ofsar(1) are fixed by the generic Unix standard SVID3, but the format and options foriostat can be changed. In Solaris 2.6, existingiostatoptions are unchanged, and apart from extra entries that appear fortape drives and NFS mount points (described later), anyone storingiostat data from a mixture of Solaris 2 systems will get a consistent format. New options that extendiostat are as follows:

-E full error statistics -e error summary statistics -n disk name and NFS mount point translation, extended service time -M MB/s instead of KB/s -P partitions only -p disks and partitions

Here are examples of some of the newiostat formats.

% iostat -xp 
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
sd106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd106,a 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd106,b 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd106,c 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
st47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

% iostat -xe 
extended device statistics ---- errors ----
device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot
sd106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0
st47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0

% iostat -E 

sd106 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST15230W SUN4.2G Revision: 0626 Serial No:
00193749
RPM: 7200 Heads: 16 Size: 4.29GB <4292075520 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

st47 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: EXABYTE Product: EXB-8505SMBANSH2 Revision: 0793 Serial No:

New NFS Metrics

Localdisk and NFS usage are functionally interchangeable, so Solaris 2.6 waschanged to instrument NFS client mount points like disks! NFS mountsare always shown byiostat andsar.With automounted directories coming and going more often than diskscoming online, that change may cause problems for performance toolsthat don’t expect the number ofiostat orsar records to change often. We had to do some work on the SE toolkit to handle dynamic configuration changes properly.

Thefull instrumentation includes the wait queue for commands in the client(biod wait) that have not yet been sent to the server. The active queuemeasures commands currently in the server. Utilization (%busy)indicates the server mount-point activity level. Note that unlike thecase with disks, 100% busy does NOT indicate that the server itself is saturated;it just indicates that the client always has outstanding requests tothat server. An NFS server is much more complex than a disk drive andcan handle many more simultaneous requests than a single disk drive can.

The example shows off the new-xnP option, although NFS mounts appear in all formats. Note that theP option suppresses disks and shows only disk partitions. Thexn option breaks down the response time,svc_t, into wait and active times and puts the device name at the end of the line so that long names don’t mess up the columns. Thevold entry automounts floppy and CD-ROM devices.

Code View:Scroll/Show All
% iostat -xnP 
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 crun:vold(pid363)
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 servdist:/usr/dist
0.0 0.5 0.0 7.9 0.0 0.0 0.0 20.7 0 1
servhome:/export/home2/adrianc
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 servmail:/var/mail
0.0 1.3 0.0 10.4 0.0 0.2 0.0 128.0 0 2 c0t2d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t2d0s2


Idle Disks and Long Service Times

Disks can behave in puzzling ways. The only way to find out what is really happening is to delve beneath the averages shown byiostat.Solaris 2.5 and later releases include a trace system that lets you seeeach individual disk access. As an example, I’ll track down an oddbehavior iniostat that has been bugging people for years.

Ikeep seeing disks that are lightly used but have extremely largeservice times. Disks are supposed to have an average seek time of about10–20 ms, so why do they often report over 100 ms when they can’tpossibly be overloaded? Why does this happen? Is it a sign of a problem?

                                 extended disk statistics 
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b
sd2 1.3 0.3 11.7 3.3 0.1 0.1 146.6 0 3
sd3 0.0 0.1 0.1 0.7 0.0 0.0 131.0 0 0

Thisis one of those recurring questions that everyone seems to ask at onetime or another, The short answer is that the apparent overload cansafely be ignored because the disks are so lightly used that they don’tmake a difference to the performance of the system. This answer israther unsatisfying because it doesn’t explain why the large servicetimes occur in the first place. Several theories are circulating:

Is it a bug?

Thistheory can be dismissed. The problem has been seen for many years andhas even been reported as a bug and investigated. The calculations iniostat are well tested and correct.

Is it caused by rounding error at low activity levels?

Thisis what I thought the problem was for many years. It was only when Iused I/O tracing to look at near-idle disks that I found out what wasreally going on. Rounding errors cannot explain the high service timeswe see.

Is it caused by energy star and thermal recalibration?

Moderndisks have a mind of their own. If you stop using them for a while,they power off their circuitry and can even be programmed to spin downcompletely. Even when they are in use, they go through a recalibrationsequence every now and again. This recalibration keeps the headsperfectly aligned even when temperature changes cause thermalexpansion. While it’s true that these activities will increase servicetimes, they should be relatively infrequent. We might be able to findthis kind of access in an I/O trace. It should appear as an isolated,short-distance seek that takes a long time.

Is it something to do with the file system?

Ifyour system has disks that are used raw by a database such as Sybase orOracle, then you may notice that service times are much shorteroverall. Even when activity levels are low, the service time stays low.This observation seems to rule out recalibration and energy star as thecause. Perhaps the filesystem layout is a problem? Inodes, indirectblocks, and data blocks may be scattered all over the disk. The extralong-distance seeks required to find the inode, get the indirect block,and, finally, read or write data probably explain why filesystemservice times are generally higher than raw disk service times. At lowusage levels, though, they still don’t explain why we see such longseek times.

Is it the filesystem flush process?

Thefsflushprocess (pid 3) keeps the data on disk in sync with the data in memoryat regular intervals. It normally runs every 5 seconds to flush dataand does additional work to flush inodes every 30 seconds. Thisfrequency is enough to be a good candidate for the problem. It alsoexplains why the problem is not seen on raw disks. The only way tosettle this question is to trace each individual access to see whichprocess initiated the I/O, how long it took, and what else is happeningat the same time. Solaris 2.5 introduced a new tracing capability thatwe can use, so I’ll start by explaining how it works, then use it tocapture the I/O trace.

The Solaris 2.5 Trace Capability

Unixsystems have had a kernel trace capability for many years. It wasdesigned for development and debugging, not for end users. Theproduction kernel is normally built without the trace capability forperformance reasons. One of the first production kernels to includetracing was IBM’s AIX kernel on the RS/6000 range. They left it turnedon during early releases to assist in debugging, then decided thattracing was useful enough to pay its way and the overhead was quitelow, so it is now a permanent feature. SunSoft also recognized thevalue of trace information but decided to extend the trace capabilityto make it more generally useful and to implement it alongside theexisting kernel trace system. It was introduced in Solaris 2.5 andconsists of the following features.

  • A self-describing trace output format, called Trace Normal Form (TNF), allows data structures to be embedded in the trace file without the need for an external definition of their types and contents.

  • A set of libraries allows user-level programs to generate trace data. In particular, this trace data helps analyze and debug multithreaded applications.

  • A well-defined set of kernel probe points covering most important events was implemented.

  • A program prex(1) controls probe execution for both user and kernel traces.

  • A program tnfxtract(1) reads out the kernel trace buffer, and tnfdump(1) displays TNF data in human-readable ASCII.

  • There are manual pages for all the commands and library calls. The set of implemented kernel probes is documented in tnf_probes(4).

Afew things about kernel probes are inconvenient. While user-levelprobes can write to trace files, the kernel probes write to a ringbuffer. This buffer is not a single global ring; it is a buffer perkernel thread. This buffer scheme avoids any need to lock the datastructures, so there is no performance loss or contention onmultiprocessor systems. You cannot easily tell how big the buffer needsto be, and one highly active probe point may loop right round itsbuffer while others have hardly started. If you are trying to captureevery single probe, make the buffer as big as you can. In general, itis best to work with low event rate probes or rely on sampling and putup with missing probes. Thetnfxtract routine just takes a snapshot of the buffer, so a second snapshot will include anything left over from the first one. Thetnfdump program does quite a lot of work to sort the probe events into time order.

More TNF information, including a free but unsupported GUI Trace browser, can be found at the http://opcom.sun.ca web site.

I/O Trace: Commands and Features

Thecommand sequence to initiate an I/O trace is quite simple. You run thecommands as root, and you need a directory to hold the output. I findit easiest to have two windows open: one to runprex, and the other to go through a cycle of extracting, dumping, and viewing the data as required. The command sequence forprex is to first allocate a buffer (the default is 384 Kbytes; you can make it bigger), enable theio group of probes, make them trace accesses, then turn on the global flag that enables all kernel tracing.

# prex -k 
Type "help" for help ...
prex> buffer alloc
Buffer of size 393216 bytes allocated
prex> enable io
prex> trace io
prex> ktrace on

Now, wait a while or run the program you want to trace. In this case, I raniostat -x10in another window, didn’t try to cause any activity, and waited forsome slow service time to appear. After a minute or so, I stoppedcollecting.

prex> ktrace off

In the other window I extracted the data and dumped it to take a look.

# mkdir /tmp/tnf 
# cd /tmp/tnf
# tnfxtract io.tnf
# tnfdump io.tnf | more

Thefirst section identifies the three probes that were enabled becausethey have the key “io.” The “strategy” probe records an I/O beingissued; “biodone” records an I/O completing. The other mechanism thatmay cause I/O is the pageout scanner, and this is also being probed.The location of the probe in the Solaris source code is printed forkernel developers to refer to. These probes include the time taken toallocate pages of memory to hold the results of the I/O request. Beingshort on memory can slightly delay an I/O.

probe    tnf_name: "strategy" tnf_string: "keys io blockio;file 
../../common/os/driver.c;line 358;"
probe tnf_name: "pageout" tnf_string: "keys vm pageio io;file
../../common/vm/vm_pvn.c;line 511;"
probe tnf_name: "biodone" tnf_string: "keys io blockio;file
../../common/os/bio.c;line 935;"

Thenext section is a table of probe events in time order. When a userprocess was scheduling the I/O, its PID was recorded. Events caused byinterrupts or kernel activity were recorded as PID 0. This trace wasmeasured on a dual CPU SPARCstation 10, so you can see a mixture of CPU0 and CPU 2. I’ve listed the first few records, then skipped through,looking for something interesting.

Code View:Scroll/Show All
Elapsed    Delta      PID      TID             CPU Probe    Data / 
(ms) (ms) LWPID Name Description
0.00000 0.00000 632 1 0xf61a1d80 0 strategy device: 8388632
block: 795344
size: 8192
buf: 0xf5afdb40
flags: 9
108.450 108.450 0 0 0xfbf6aec0 2 biodone device: 8388632
block: 795344
buf: 0xf5afdb40
108.977 0.52700 632 1 0xf61a1d80 0 strategy device: 8388632
block: 795086
size: 1024
buf: 0xf610a358
flags: 524569
121.557 12.5800 0 0 0xfbf6aec0 2 biodone device: 8388632
block: 795086
buf: 0xf610a358
121.755 0.17900 632 1 0xf61a1d80 0 pageout vnode: 0xf61d0a48
pages_pageout: 0
pages_freed: 0
pages_reclaimed 0


Thefirst strategy routine is an 8-Kbyte write (flags bit 0x40 is set forreads) to block 795344. To decode the device, 8388632 = 0x800018, 0x18= 24, and runningls -lL on/dev/dsk shows thatc0t3d0s0 is minor device 24, which is mounted as/export/home.It’s an old Sun 424-Mbyte drive, so valid blocks are from 0 to 828719.This a seek to right at the end of the disk, and it takes 108 ms. It isfollowed by a 1-Kbyte write to a nearby block and takes only 12.58 ms.The pageout scanner runs but finds nothing to do, so it reports nopages paged out, freed, or reclaimed this time. After 8 secondseverything goes quiet, then there is a very interesting burst ofactivity around the 22-second mark that is all initiated byPID 3, thefsflush process.

Code View:Scroll/Show All
Elapsed       Delta       PID    TID             CPU Probe    Data / 
(ms) (ms) Name Description
8034.21800 0.019000 0 0 0xfbf6aec0 2 biodone device: 8388632
block: 796976
buf: 0xf610a538
21862.0155 13827.72 3 1 0xf5d8cc80 2 strategy device: 8388632
block: 240
size: 2048
buf: 0xf5e71158
flags: 9
21897.7560 35.74050 0 0 0xfbf6aec0 2 biodone device: 8388632
block: 240
buf: 0xf5e71158
21897.9440 0.188000 3 1 0xf5d8cc80 0 strategy device: 8388632
block: 16
ize: 2048
buf: 0xf5adadc0
flags: 524809
21907.1305 9.186500 0 0 0xfbf6aec0 2 biodone device: 8388632
block: 16
buf: 0xf5adadc0


Westart by seeking from block 796976 to 240 with a 36-ms write, followedby block 16 with a 9-ms write. This write is probably the filesystemsuperblock being updated. The following set of writes are issuedextremely close together, 14 of them in about 1.5 ms, and they refer totwo different devices. The set starts with device 8388624, which mapsto c0t2d0s0, which is a Sun 1.05-GB disk holding the root file system,then also accesses/export/homebriefly during the sequence. For brevity, I have trimmed this output tojust the root disk and removed data that is the same as the first tracerecord.

Code View:Scroll/Show All
Elapsed       Delta       PID   TID             CPU Probe    Data / 
(ms) (ms) Name Description
21934.5060 27.3755 3 1 0xf5d8cc80 0 strategy device: 8388624
block: 32544
size: 8192
buf: 0xf5afd780
flags: 1033
21934.7320 0.2260 3 1 0xf5d8cc80 0 strategy block: 64896
buf: 0xf5e71a40
21934.9490 0.2170 3 1 0xf5d8cc80 0 strategy block: 887696
buf: 0xf5e71ef0
21935.0420 0.0930 3 1 0xf5d8cc80 0 strategy block: 855296
buf: 0xf5e70e10
21935.1290 0.0870 3 1 0xf5d8cc80 0 strategy block: 855248
buf: 0xf5e70438
21935.3265 0.1280 3 1 0xf5d8cc80 0 strategy block: 678272
buf: 0xf5afd000
21935.4935 0.0825 3 1 0xf5d8cc80 0 strategy block: 887664
buf: 0xf5aff760
21935.5805 0.0870 3 1 0xf5d8cc80 0 strategy block: 194560
buf: 0xf5afdc30
21935.7530 0.0770 3 1 0xf5d8cc80 0 strategy block: 32480
buf: 0xf5afe4a0
21935.8330 0.0800 3 1 0xf5d8cc80 0 strategy block: 887680
buf: 0xf5afe6f8
21935.9115 0.0785 3 1 0xf5d8cc80 0 strategy block: 1629440
buf: 0xf5afd078


Therequested seek order on root is blocks 32544, 64896, 887696, 855296,855248, 678272, 887644, 194560, 32480, 887680, 1629440. That’s prettyrandom! None of the requests complete until after the last one isissued. Here are the completions.

Code View:Scroll/Show All
Elapsed       Delta      PID    TID             CPU Probe    Data / 
(ms) (ms) Name Description
21964.1510 28.2395 0 0 0xfbe1cec0 0 biodone device: 8388624
block: 32544
buf: 0xf5afd780
21980.0535 8.2955 0 0 0xfbf6aec0 2 biodone block: 64896
buf: 0xf5e71a40
21994.0765 6.0985 0 0 0xfbe1cec0 0 biodone block: 887696
buf: 0xf5e71ef0
22009.2190 13.9775 0 0 0xfbf6aec0 2 biodone block: 855296
buf: 0xf5e70e10
22023.8295 14.6105 0 0 0xfbe1cec0 0 biodone block: 855248
buf: 0xf5e70438
22037.5215 13.6920 0 0 0xfbf6aec0 2 biodone block: 678272
buf: 0xf5afd000
22055.0835 17.5620 0 0 0xfbe1cec0 0 biodone block: 887664
buf: 0xf5aff760
22077.5950 22.5115 0 0 0xfbf6aec0 2 biodone block: 194560
buf: 0xf5afdc30
22099.4810 21.8860 0 0 0xfbe1cec0 0 biodone block: 32480
buf: 0xf5afe4a0
22125.7145 26.2335 0 0 0xfbf6aec0 2 biodone block: 887680
buf: 0xf5afe6f8
22146.7055 20.9910 0 0 0xfbe1cec0 0 biodone block: 1629440
buf: 0xf5afd078


Theorder of completion for this disk is the same as the order of issuebecause the firmware in this old, Sun 1.05-Gbyte disk is notintelligent enough to reorder seeks. More recent disks would reorderthe operations, and responses would be seen out of order. There areabout 1,000 blocks per cylinder on this disk, so we can approximate thenumber of tracks per seek. Remember that disks are variable geometrynowadays, so it’s probably more than 1,000 blocks per cylinder nearblock 0 and progressively less at higher block numbers. For the rootdisk only, the delay and seek distances are shown in Table 8-3.

Table 8-3. Trace-Derived Sequence of Disk Service Times
Block Seek Cylinders Service Time Response Time 32544 1333 29.65 ms 29.65 ms 64896 32 15.67 ms 45.32 ms 887696 822 13.81 ms 59.13 ms 855296 back 32 15.05 ms 74.18 ms 855248 0 14.52 ms 88.70 ms 678272 back 177 13.50 ms 102.2 ms 887644 209 17.39 ms 119.59 ms 194560 back 693 22.42 ms 142.01 ms 32480 back 162 21.72 ms 163.73 ms 887680 855 26.15 ms 189.88 ms 1629440 742 20.91 ms 210.79 ms Average   19.16 ms 111.38 ms

Whatwe see is that 11 writes of 8 Kbytes were fired off in one go, randomlyspread over the disk, and that they completed sequentially, taking onaverage 19 ms each (not that bad, and accesses to the other disk mayhave slowed it down a bit). The later accesses were delayed by theearlier ones sufficiently that this burst (which took only 210 ms tocomplete) would report an average response time of 111 ms iniostat. Measured over a 10-second interval and if there was no other disk activity, we would see something like this:

% iostat -x 10 
...
extended disk statistics
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b
sd2 0.0 1.1 0.0 8.8 0.0 0.0 111.4 0 2

Now, that output looks very familiar. It seems that we have found the culprit!fsflushis causing large service-time reports by kicking off a burst ofactivity. In most cases, applications generate access patterns that areeither sequential bursts (large transfers) or separate random accessesin a sequence over time. In this case, a long, completely randomsequence is being generated in such a short time that a queue forms. Itis all over with very quickly and occurs only every 30 seconds or so,but random sequence generation is quite likely to happen on any diskthat contains a file system. The trace data shown was taken at thefirst attempt, so it isn’t hard to find this situation occurring.

Ihope you are also inspired to start using TNF. It can illuminate allkinds of situations for you and, at user level, is a great way ofembedding timing points into applications. A dormant probe point isextremely lightweight, so there is little penalty, and probes can beleft in production code.

Howiostat Uses the Underlying Disk Measurements

To really understand the data being presented byiostat,sar,and other tools, you need to look at the raw data being collected bythe kernel, remember some history, and do some simple mathematics.

Inthe old days, disk controllers really did control the disks directly.If you still remember the old Sun machines with SMD disks and Xylogicscontrollers, you may know what I mean. All the intelligence was in thecontroller, which was a large VMEbus card inside the system cabinet.The disk heads were directly connected to the controller, and thedevice driver knew exactly which track the disk was reading. As eachbit was read from disk, it was buffered in the controller until a wholedisk block was ready to be passed to the device driver.

Thedevice driver maintained a queue of waiting requests, which wereserviced one at a time by the disk. From this, the system could reportthe service time directly as milliseconds-per-seek. The throughput intransfers per second was also reported, as was the percentage of thetime that the disk was busy, the utilization. The terms utilization, service time, wait time, throughput, and wait queue lengthhave well-defined meanings in this scenario. A set of simple equationsfrom queuing theory can be used to derive these values from underlyingmeasurements. The original version ofiostat in SunOS 3.X and SunOS 4.X was basically the same as in BSD Unix.

Overtime, disk technology moved on. Nowadays, a standard disk is SCSI basedand has an embedded controller. The disk drive contains a smallmicroprocessor and about 1 Mbyte of RAM. It can typically handle up to64 outstanding requests via SCSI tagged-command queuing. The systemuses a SCSI Host Bus Adaptor (HBA) to talk to the disk. In largesystems, there is another level of intelligence and buffering in ahardware RAID controller. The simple model of a disk used byiostat and its terminology have become confused.

Inthe old days, once the device driver sent the disk a request, it knewthat the disk would do nothing else until the request was complete. Thetime it took was the service time, and the average service time was aproperty of the disk itself. Disks that spin faster and seek fasterhave lower (better) service times. With today’s systems, the devicedriver issues a request, that request is queued internally by the RAIDcontroller and the disk drive, and several more requests can be sentbefore the first one comes back. The service time, as measured by thedevice driver, varies according to the load level and queue length andis not directly comparable to the old-style service time of a simpledisk drive.

Theinstrumentation provided in Solaris 2 takes account of this change byexplicitly measuring a two-stage queue: one queue, called the waitqueue, in the device driver; and one queue, called the active queue, inthe device itself. A read or write command is issued to the devicedriver and sits in the wait queue until the SCSI bus and disk are bothready. When the command is sent to the disk device, it moves to theactive queue until the disk sends its response. The problem withiostatis that it tries to report the new measurements in some of the originalterminology. The “wait service time” is actually the time spent in the“wait” queue. This is not the right definition of service time in anycase, and the word “wait” is being used to mean two different things.To sort out what we really do have, we need to move on to do somemathematics.

Let’sstart with the actual measurements made by the kernel. For each diskdrive (and each disk partition, tape drive, and NFS mount in Solaris2.6), a small set of counters is updated. An annotated copy of thekstat(3K)-based data structure that the SE toolkit uses is shown in Figure 8-3.

Figure 8-3. Kernel Disk Information Statistics Data Structure
Code View: Scroll / Show All
struct ks_disks {
long number$; /* linear disk number */
string name$; /* name of the device */

ulonglong nread; /* number of bytes read */
ulonglong nwritten; /* number of bytes written */
ulong reads; /* number of reads */
ulong writes; /* number of writes */
longlong wtime; /* wait queue - time spent waiting */
longlong wlentime; /* wait queue - sum of queue length multiplied
by time at that length */
longlong wlastupdate;/* wait queue - time of last update */
longlong rtime; /* active/run queue - time spent active/running */
longlong rlentime; /* active/run queue - sum of queue length * time
at that length */
longlong rlastupdate;/* active/run queue - time of last update */
ulong wcnt; /* wait queue - current queue length */
ulong rcnt; /* active/run queue - current queue length */
};


None of these values are printed out directly byiostat,so this is where the basic arithmetic starts. The first thing torealize is that the underlying metrics are cumulative counters orinstantaneous values. The values printed byiostat are averages over a time interval. We need to take two copies of the above data structure together with a high resolution timestamp for each and do some subtraction. We then get the average values between the start and endtimes. I’ll write it out as plainly as possible, with pseudocode thatassumes an array of two values for each measure, indexed by start and end. Thires is in units of nanoseconds, so we divide to get seconds as T.

Thires = hires elapsed time = EndTime – StartTime = timestamp[end] – timestamp[start]

Bwait = hires busy time for wait queue = wtime[end] – wtime[start]

Brun = hires busy time for run queue = rtime[end] – rtime[start]

QBwait = wait queue length * time = wlentime[end] – wlentime[start]

QBrun = run queue length * time = rlentime[end] – rlentime[start]

Now,we assume that all disk commands complete fairly quickly, so thearrival and completion rates are the same in a steady state average,and the throughput of both queues is the same. I’ll use completionsbelow because it seems more intuitive in this case.

A similar calculation gets us the data rate in kilobytes per second.

Next, we can obtain the utilization—the busy time as a percentage of the total time.

Now, we get to something called service time, but it is not whatiostat prints out and calls service time. This is the real thing!

The meaning of Srunis as close as you can get to the old-style disk service time. Rememberthat the disk can run more than one command at a time and can returnthem in a different order than they were issued, and it becomes clearthat it cannot be the same thing.

Thedata structure contains an instantaneous measure of queue length, butwe want the average over the time interval. We get this from thatstrange “length time” product by dividing it by the busy time.

Finally, we can get the number thatiostatcalls service time. It is defined as the queue length divided by thethroughput, but it is actually the residence or response time andincludes all queuing effects. The real definition of service time isthe time taken for the first command in line to be processed, and itsvalue is not printed out byiostat. Thanks to the SE toolkit, this deficiency is easily fixed. A “corrected” version ofiostat written in SE prints out the data, using the format shown in Figure 8-4. This new format is used by several scripts in SE release 3.0.

Figure 8-4. SE-Based Rewrite ofiostat to Show Service Time Correctly
% se siostat.se 10 
03:42:50 ------throughput------ -----wait queue----- ----active queue----
disk r/s w/s Kr/s Kw/s qlen res_t svc_t %ut qlen res_t svc_t %ut
c0t2d0s0 0.0 0.2 0.0 1.2 0.00 0.02 0.02 0 0.00 22.87 22.87 0
03:43:00 ------throughput------ -----wait queue----- ----active queue----
disk r/s w/s Kr/s Kw/s qlen res_t svc_t %ut qlen res_t svc_t %ut
c0t2d0s0 0.0 3.2 0.0 23.1 0.00 0.01 0.01 0 0.72 225.45 16.20 5

TheSolaris 2 disk instrumentation is complete and accurate. Now that ithas been extended to tapes, partitions, and client NFS mount points,there is a lot more that can be done with it. It’s a pity that thenaming conventions used byiostat are so confusing and thatsar -d mangles the data so much for display. We asked ifsar could be fixed, but its output format and options are largely constrained by cross-platform Unix standards. We tried to getiostatfixed, but it was felt that the current naming convention was whatusers expected to see, so changing the header or data too much wouldconfuse existing users. Hopefully, this translation of existingpractice into the correct terminology will help reduce the confusionsomewhat.

Filesystem Tuning

TheUFS filesystem code includes an I/O clustering algorithm, which groupssuccessive reads or writes into a single, large command to transfer upto 56 Kbytes rather than lots of 8-Kbyte transfers. This groupingallows the filesystem layout to be tuned to avoid sector interleavingand allows filesystem I/O on sequential files to get close to itstheoretical maximum.

The UFS filesystem layout parameters can be modified by means oftunefs[1].By default, these parameters are set to provide maximum overallthroughput for all combinations of read, write, and update operationsin both random and sequential access patterns. For a single disk, thegains that can be made by optimizing disk layout parameters withtunefs for one kind of operation over another are small.

[1] See the manual page and The Design and Implementation of the 4.3BSD UNIX Operating System by Leffler, McKusick, Karels, and Quarterman for details of the filesystem implementation.

The clustering algorithm is controlled by thetunefs parameters,rotdelay andmaxcontig. The defaultrotdelay parameter is zero, meaning that files are stored in contiguous sectors. Any other value disables clustering. The defaultmaxcontigparameter is 7; this is the maximum transfer size supported on allmachines, and it gives a good compromise for performance with variousaccess patterns.

Higher values can be configured and can be useful when you are working with large sequential accesses.

Youcan free extra capacity in file systems that contain larger thanaverage files by creating the file system with a higher number of bytesper inode. File system creation time is also greatly reduced. Thedefault is 2048 bytes per inode. I often usenewfs-i8192.

Eagle DiskPak

Aproduct from Eagle Software, Inc., called DiskPak™, has some novelfeatures that can improve throughput for heavily used file systems. Theproduct reorganizes the layout of data blocks for each file on the diskto make all files sequential and contiguous and to optimize theplacement of the UFS partial block fragments that occur at the end offiles. It also has a filesystem browser utility that gives a visualrepresentation of the block layout and free space distribution of afile system. The most useful capability of this product is that it cansort files on the basis of several criteria, to minimize disk seektime. The main criteria are access time, modification time, and size.If a subset of the files is accessed most often, then it helps to groupthem together on the disk. Sorting by size helps separate a few largefiles from more commonly accessed small files. According to the vendor,speedups of 20 percent have been measured for a mixed workload. DiskPakis available for both SunOS 4.X and Solaris 2.X. You can download trialsoftware from http://www.eaglesoft.com.

Disk Specifications

Diskspecifications are sometimes reported according to a “best case”approach, which is disk-format independent. Some parameters are quotedin the same way by both disk manufacturers and computer system vendors,an approach that can confuse you because they are not measuring thesame thing. Sun uses disks from many different suppliers, includingSeagate, IBM, Fujitsu, and Quantum. For example, the Seagate web site, http://www.seagate.com, contains complete specifications for all their recent disks. The system may call the disk a SUN2.1G, but if you use theformat command to make an enquiry, as shown in Figure 8-5, or use the newiostat -E option in Solaris 2.6, you can find out the vendor and model of the disk.

Figure 8-5. How to Identify the Exact Disk Type and Vendor by Usingformat
# format 
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c0t0d0
/sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
Specify disk (enter its number): 0
selecting c0t0d0
[disk formatted]
Warning: Current Disk has mounted partitions.

FORMAT MENU:
disk - select a disk
... text omitted ....
inquiry - show vendor, product and revision
quit
format> inquiry
Vendor: SEAGATE
Product: ST32171W SUN2.1G
Revision: 7462
format> quit
#

Inthis case, on an Ultra 1/170E, the vendor and disk are SeagateST32171W. Visiting the Seagate web site, I found a data sheet thatquoted this file: ftp://ftp.seagate.com/techsuppt/scsi/st32171w.txt. Excerpts are shown in Figure 8-6.

Figure 8-6. Example Vendor Disk Specification Sheet
Code View: Scroll / Show All
ST-32171W/WC Ultra-SCSI Wide (Barracuda 4LP) 
FORMATTED CAPACITY (GB) __________________2.16
AVERAGE SECTORS PER TRACK ________________163 rounded down
ACTUATOR TYPE ____________________________ROTARY VOICE COIL
TRACKS ___________________________________25,890
CYLINDERS ________________________________5,178 user
HEADS ______PHYSICAL______________________5
DISCS (3.5 in) ___________________________3
MEDIA TYPE _______________________________THIN FILM/MR
RECORDING METHOD _________________________ZBR PRML (0,4,4)
INTERNAL TRANSFER RATE (mbits/sec)________80 to 122
EXTERNAL TRANSFER RATE (mbyte/sec) _______40 Sync
SPINDLE SPEED (RPM) ______________________7,200
AVERAGE LATENCY (mSEC) ___________________4.17
BUFFER ___________________________________512 Kbyte
Read Look-Ahead, Adaptive,
Multi-Segmented Cache
INTERFACE ________________________________Ultra SCSI
ASA II, SCAM level 2
BYTES PER TRACK __________________________102,500 average
SECTORS PER DRIVE ________________________
TPI (TRACKS PER INCH) ____________________5,555
BPI (PEAK KBITS PER INCH) ________________123
AVERAGE ACCESS (ms read/write)____________9.4/10.4
Drive level with controller overhead
SINGLE TRACK SEEK (ms read/write) ________2.0/2.1
MAX FULL SEEK (ms read/write) ____________17.5/18.5
MTBF (power-on hours) ____________________1,000,000


This excerpt includes the basic information that we need.

What the Disk Makers Specify

The disk manufacturers specify certain parameters for a drive.

  • Rotational speed in revolutions per minute (rpm)

  • The number of tracks or cylinders on the disk

  • The number of heads or surfaces in each cylinder

  • The rate at which data is read and written (millions of bytes/s)

  • The disk controller interface used (IDE, SCSI, Wide SCSI, Ultra SCSI, FC-AL)

  • The capacity of the drive (millions of bytes)

  • The average access time and single track and maximum seek time of the disk

Whendisk makers build the read/write heads, the speed of data transfer ismeasured in megahertz (MHz), which is converted into megabytes bydivision by eight. All the tracks that can be accessed without movingthe heads form a cylinder. The single cylinder seek time is the timetaken to move the heads to the next cylinder. On many high-performancedrives, the head assembly is accelerated for the first half of the seekand decelerated for the second half. In this way, a seek of severalcylinders in one attempt takes less time than many short steps withstops along the way. The average cylinder-to-cylinder seek time isusually calculated by accounting for all possible seek distances. Toget to a particular sector on the disk, the disk must rotate until itcomes past the heads, so another component of overall access time isthe rotational latency of the disk. The average rotational latency isusually quoted as the time taken for half a rotation.

Theaverage access time quoted by the manufacturer is the average seek androtate time. The disk quoted above is about 10 ms. The averagerotational latency for 7200 rpm drives is 4.1 ms, and the seek timevaries from 2 ms to 18 ms, as shown in Figure 8-7.

Figure 8-7. Average Disk Seek and Rotation Components


What the System Vendors Specify

Thesystem vendors need to deal with the disk in terms of sectors, eachtypically containing 512 bytes of data and many bytes of header,preamble, and intersector gap. Spare sectors and spare cylinders arealso allocated, so that bad sectors can be substituted. This allocationreduces the unformatted capacity to what is known as the formatted capacity. For example, the disk shown in Figure 8-6has an unformatted capacity of 102,500 bytes per track and 25,890tracks, making a total of 2.65 Gbytes. It has a formatted capacity of163 sectors per track using 512 byte sectors, making a capacity of 2.16Gbytes. In recent years, vendors seem to have become more consistentabout quoting the formatted capacity for drives. The formatted capacityof the drive is measured in units of Mbytes = 106 = 1,000,000, whereas RAM sizes are measured in units of Mbytes = 220=1,048,576. Confused? You will be! It is very easy to mix these up andmake calculations that have a built-in error of 4.8 percent in the timetaken to write a block of memory to disk.

Manyof Sun’s disks are multiply sourced to a common specification. Eachvendor’s disk has to meet or exceed that specification, so Sun’s ownpublished performance data may be a little bit more conservative. Youshould also be cautious about the data. I have come acrossinconsistencies in the speeds, seek times, and other details. I’ve alsofound disks advertising via SCSI inquiry that they are a completelydifferent specification from that on the data sheet for that model.

What You Can Work Out for Yourself

For some old disks, you can work out, using information from/etc/format.dat, the real peak throughput and size in Kbytes (1024) of your disk. The entry for a typical disk is shown in Figure 8-8. In Solaris 2.3 and later releases, the format information is read directly from SCSI disk drives by theformat command. The/etc/format.dat file entry is no longer required for SCSI disks.

Figure 8-8./etc/format.dat Entry for Sun 669-Mbyte Disk
disk_type= "SUN0669" \ 
: ctlr= MD21: fmt_time= 4 \
: trks_zone= 15: asect= 5: atrks= 30 \
: ncyl= 1614: acyl= 2: pcyl= 1632: nhead= 15: nsect= 54 \
: rpm= 3600 : bpt= 31410

The values to note are:

  • rpm = 3600, so the disk spins at 3600 rpm

  • nsect = 54, so there are 54 sectors of 512 bytes per track

  • nhead = 15, so there are 15 tracks per cylinder

  • ncyl = 1614, so there are 1614 cylinders per disk

Sincewe know that there are 512 bytes per sector, 54 sectors per track andthat a track will pass by the head 3,600 times per minute, we can workout the peak sustained data rate and size of the disk.

data rate (bytes/sec) = (nsect * 512 * rpm) / 60 = 1,658,880 bytes/s

size (bytes) = nsect * 512 * nhead * ncyl = 669,358,080 bytes

If we define that 1 Kbyte is 1024 bytes, then the data rate is 1620 Kbytes/s.

Standardizingon 1 Kbyte = 1024 is convenient since this is what the disk-monitoringutilities assume; since sectors are 512 bytes and pages are 4,096 bytesand the UFS file system uses 8,192 byte blocks, the standard 1 Kbyte =1024 is more useful than 1K = 1000.

TheSeagate data sheet does not give us the formatted data rate, but itdoes indicate that there are 168 sectors per track on average. At 7200rpm, this data works out as:

data rate = 168 * 512 * 7200 / 60 = 10,080 Kbytes/s

Thisresult shows the progress in disk performance that has been made overseven years or so. Random access rates have doubled as the disks spintwice as fast, but sequential data rates are up by a factor of six.

Some Sun disks are listed in Table 8-4, with Kbyte = 210 = 1024. The data rate for ZBR drives cannot be calculated simply since theformat.datnsect entry is not the real value. Instead, I have provided an averagevalue for sequential reads based on the quoted average number ofsectors per track for an equivalent Seagate model.

Table 8-4. Common Disk Specifications
Disk Type Bus MB/s RPM Access Capacity Avg Data Rate Sun 207 3.5” 5 MB/s SCSI 1.6 3600 16 ms 203148 KB 1080 KB/s Sun 424 3.5” 5 MB/s SCSI 2.5-3.0 4400 14 ms 414360 KB 2584 KB/s ZBR Sun 535 3.5x1” 10 MB/s SCSI 2.9-5.1 5400 11 ms 522480 KB 3608 KB/s ZBR Sun 669 5.25” 5 MB/s SCSI 1.8 3600 16 ms 653670 KB 1620 KB/s Sun 1.05G 3.5x1.6” 10 MB/s SCSI 2.9-5.0 5400 11 ms 1026144 KB 3840 KB/s ZBR Sun 1.05G 3.5x1” 20 MB/s SCSI 2.9-5.0 5400 11 ms 1026144 KB 3968 KB/s ZBR Sun 1.3G 5.25” 5 MB/s SCSI 3.0-4.5 5400 11 ms 1336200 KB 3288 KB/s ZBR Sun 1.3G 5.25” 6 MB/s IPI 3.0-4.5 5400 11 ms 1255059 KB 2610-3510 KB/s Sun 2.1G 5.25” 10 MB/s DSCSI 3.8-5.0 5400 11 ms 2077080 KB 3952 KB/s ZBR Sun 2.1G 3.5x1.6” 10 MB/s SCSI 3.7-5.2 5400 11 ms 2077080 KB 3735 KB/s ZBR Sun 2.1G 3.5x1” 20 MB/s SCSI 6.2-9.0 7200 8.5 ms 2077080 KB 6480 KB/s ZBR Sun 2.1G 3.5x1” 40 MB/s SCSI 10-15 7200 8.5 ms 2077080 KB 10080 KB/s ZBR Sun 2.9G 5.25” 20 MB/s DSCSI 4.4-6.5 5400 11 ms 2841993 KB 4455 KB/s ZBR Sun 4.2G 3.5x1.6” 20 MB/s SCSI 4.2-7.8 5400 10 ms 4248046 KB 4950 KB/s ZBR Sun 4.2G 3.5x1” 40 MB/s SCSI 10-15.2 7200 10 ms 4248046 KB 9840 KB/s ZBR Sun 9.1G 5” 20 MB/s DSCSI 5.5-8.1 5400 11.5 ms 8876953 KB 5985 KB/s ZBR Sun 9.1G 3.5x1.6” 40 MB/s SCSI 10-15.2 7200 8.5 ms 8876953 KB 10080 KB/s ZBR Sun 9.1G 3.5x1” 100 MB/s FC 10-15.2 7200 8.5 ms 8876953 KB 10080 KB/s ZBR

Zoned Bit Rate (ZBR) Disk Drives

ZBRdrives vary, depending on which cylinder is accessed. The disk isdivided into zones with different bit rates; the outer part of thedrive is faster and has more sectors per track than does the inner partof the drive. This scheme allows the data to be recorded with aconstant linear density along the track (bits per inch). In otherdrives, the peak number of bits per inch that can be made to workreliably is set up for the innermost track, but density is too low onthe outermost track, so capacity is wasted. In a ZBR drive, more datais stored on the outer tracks, so greater capacity and higher datarates are possible. The drive zones provide peak performance from thefirst third of the disk. The next third falls off slightly, but thelast third of the disk may be as much as 25 percent slower. Table 8-5 summarizes performance for a 1.3-Gbyte ZBR disk drive.

Note

When partitioning a ZBR disk, remember that partition “a” or slice 0 will be faster than partition “h” or slice 7.


Table 8-5. 1.3-Gbyte IPI ZBR Disk Zone Map
Zone Start Cylinder Sectors per Track Data Rate in Kbytes/s 0 0 78 3510 1 626 78 3510 2 701 76 3420 3 801 74 3330 4 926 72 3240 5 1051 72 3240 6 1176 70 3150 7 1301 68 3060 8 1401 66 2970 9 1501 64 2880 10 1601 62 2790 11 1801 60 2700 12 1901 58 2610 13 2001 58 2610

Theformat.dat entry assumes constant geometry, so it has a fixed idea about sectors per track, and the number of cylinders informat.datis reduced to compensate. The number of sectors per track is set tomake sure partitions start on multiples of 16 blocks and does notaccurately reflect the geometry of the outer zone. The 1.3-Gbyte IPIdrive outer zone happens to matchformat.dat, but the other ZBR drives have more sectors thanformat.dat states.

IPI Disk Controllers

Aslong as IPI disks enjoyed a performance advantage over SCSI disks, theextra cost of low-volume, IPI-specific disks could be justified. Nowthat SCSI disks are higher capacity, faster, cheaper, and moreintelligent, IPI subsystems have become obsolete. Each SCSI disknowadays contains about as much intelligence and RAM buffer as theentire IPI controller from a few years ago.

AnSBus IPI controller is available from Genroco, Inc., so that old disksubsystems can be migrated to current generation servers.

SCSI Disk Controllers

SCSI controllers can be divided into five generic classes. Very old ones support only Asynchronous SCSI, old controllers support Synchronous SCSI, most recent ones support Fast Synchronous SCSI; the latest two types to appear are Fast and Wide Synchronous SCSI, and Ultra SCSI.In addition, there are now two kinds of fiber channel that carry theSCSI protocol. Working out which type you have on your Sun is not thatsimple. The main differences between interface types are in the numberof disks that can be supported on a SCSI bus before the bus becomessaturated and the maximum effective cable length allowed.

SCSI Interface Types

Theoriginal asynchronous and synchronous SCSI interfaces support datarates up to 5 Mbytes/s on a 6 meter effective cable length. The speedof asynchronous SCSI drops off sharply as the cable length increases,so keep the bus as short as possible.

FastSCSI increases the maximum data rate from 5 Mbytes/s to 10 Mbytes/s andhalves the cable length from 6 meters to 3 meters. By use of the latesttype of very high quality SCSI cables and active termination plugs fromSun, fast SCSI can be made to work reliably at up to 6 meters.

Fastand Wide SCSI uses a special cable with extra signals (68 rather than50 pins) to carry 16 bits of data rather than 8 bits in each clockcycle. This usage doubles the data rate to 20 Mbytes/s and provides 8more drive select lines, and so can support 15 target devices.

UltraSCSI doubles the clock rate again to 20 MHz on the same cabling andconnectors as fast-wide SCSI, hence providing a 40 Mbyte/s bandwidth.Cable lengths are halved as a consequence, severely limiting the numberof devices that can be connected. Ultra SCSI is most commonly used toconnect to a hardware RAID controller if a large number of devices needto be configured. There is also an Ultra2 SCSI that runs at 40 MHz,providing 80 Mbytes/s bandwidth, but there are serious cabling problemsat this data rate and Sun has moved on to fiber interfaces instead.

Differentialsignaling can be used with Fast SCSI, Fast/Wide SCSI, and Ultra SCSI toincrease the cable length to 25 meters (12 meters for UltraSCSI), butit uses incompatible electrical signals and a different connector andso can only be used with devices that have purpose-built differentialinterfaces. The principle is that instead of a single electrical signalvarying between 0V and +5V, the transmitter generates two signalsvarying between +5V and -5V, where one is the inverse of the other. Anynoise pickup along the cable tends to be added to both signals;however, in the receiver, the difference between the two signals isused, so the noise cancels out and the signal comes through clearly.

Sun’soriginal Fiber Channel SCSI uses a high-speed, optical fiberinterconnect to carry the SCSI protocol to a disk array. It runs atabout 25 Mbytes/s in each direction simultaneously for a 1,000 metersor more. This is actually quarter-speed fiber.

Thelatest type of fiber channel is full speed at 100 Mbytes/s in bothdirections. The very latest standard forms a loop of fiber thatconnects to individual disk drives that have been fitted with their ownfiber interfaces. This is called Fiber Channel Arbitrated Loop, orFC-AL (often pronounced eff-cal); a double loop is used for redundancyand even higher throughput. Sun’s A5000 FC-AL array can sustain 180Mbytes/s over its dual FC-AL loops that go directly to every disk inthe package.

The SCSI disks shown in Table 8-5on page 205 transfer data at a much slower rate than that of the FC-ALarray. They do, however, have a buffer built into the SCSI drive thatcollects data at the slower rate and can then pass data over the bus ata higher rate. With Asynchronous SCSI, the data is transferred by meansof a handshake protocol that slows down as the SCSI bus gets longer.For fast devices on a very short bus, the SCSI bus can achieve fullspeed, but as devices are added and the bus length and capacitanceincreases, the transfers slow down. For Synchronous SCSI, the deviceson the bus negotiate a transfer rate that will slow down if the bus islong, but by avoidance of the need to send handshakes, more data can besent in its place and throughput is less dependent on the bus length.

Tagged Command Queuing Optimizations

TCQprovides optimizations similar to those implemented on IPI controllers,but the buffer and optimization occur in each drive rather than in thecontroller for a string of disks. TCQ is implemented in Solaris 2 onlyand is supported on the disks listed in Table 8-6.Sun spent a lot of time debugging the TCQ firmware in the disks when itwas first used. Since few other vendors used this feature at first, itmay be wise to disable TCQ if old, third-party disks are configured ona system. Some old, third-party drives, when probed, indicate that theysupport TCQ but fail when it is used. This may be an issue whenupgrading from SunOS 4 (which never tries to use TCQ) to Solaris 2, butis not likely to be an issue on recent systems. In Solaris 2, TCQ isdisabled by clearing ascsi_options bit, as described in the next section.

Table 8-6. Advanced SCSI Disk Specifications
Disk Type Tagged Commands On-board Cache Sun 424 3.5” None 64 KB Sun 535 thin 3.5” 64 256 KB Sun 1.05G 3.5” 64 256 KB Sun 1.05G thin 3.5” 64 256 KB Sun 2.1G 5.25” 16 256 KB Sun 2.1G 3.5” 5400rpm 64 256 KB Sun 2.9G 5.25” 64 512 KB Most recent disks 64 512 – 2048 KB

Setting the SCSI Options Variable

A kernel variable called scsi_optionsglobally enables or disables several SCSI features. To see the optionvalues, look at the values defined in the file (for Solaris 2)/usr/include/sys/scsi/conf/autoconf.h. The default value for the kernel variable scsi_options is 0x3F8, which enables all options. To disable tagged command queuing, set scsi_options to 0x378. If wide SCSI disks are being used with Solaris 2.3, then/etc/system should have the commandset scsi_options=0x3F8 added. The command is set by default in later releases. Table 8-7 lists the scsi_options values.

Table 8-7. SCSI Options Bit Definitions
#define SCSI_OPTIONS_DR 0x8 Global disconnect/reconnect #define SCSI_OPTIONS_LINK 0x10 Global linked commands #define SCSI_OPTIONS_SYNC 0x20 Global synchronous xfer capability #define SCSI_OPTIONS_PARITY 0x40 Global parity support #define SCSI_OPTIONS_TAG 0x80 Global tagged command support #define SCSI_OPTIONS_FAST 0x100 Global FAST scsi support #define SCSI_OPTIONS_WIDE 0x200 Global WIDE scsi support

Sun’s SCSI Controller Products

Theoriginal SPARCstation 1 and the VME-hosted SCSI controller used by theSPARCserver 470 do not support Synchronous SCSI. In the case of theSPARCstation 1, this nonsupport is due to SCSI bus noise problems thatwere solved for the SPARCstation 1+ and subsequent machines. If noiseoccurs during a Synchronous SCSI transfer, then a SCSI reset happens,and, although disks will retry, tapes will abort. In versions of SunOSbefore SunOS 4.1.1, the SPARCstation 1+, IPC, and SLC have SynchronousSCSI disabled as well. The VME SCSI controller supports a maximum of1.2 Mbytes/s, whereas the original SBus SCSI supports 2.5 Mbytes/sbecause it shares its unbuffered DMA controller bandwidth with theEthernet.

TheSPARCstation 2, IPX, and ELC use a higher-performance, buffered-SBusDMA chip than in the SPARCstation 1, 1+, SLC, and IPC, and they candrive sustained SCSI bus transfers at 5.0 Mbytes/s. The original SBusSCSI add-on cards and the built-in SCSI controller on the SPARCstation330 and the original SPARCserver 600 series are also capable of thisspeed. The SPARCserver 600 spawned the first combined SCSI/BufferedEthernet card, the SBE/S; one is integrated into the CPU board.

TheSPARCstation 10 introduced the first fast SCSI implementation from Sun,together with a combined fast SCSI/Buffered Ethernet SBus card, theFSBE/S. The SPARCserver 600 series was upgraded to have the fast SCSIcontroller built in at this time.

Adifferential version of the fast SCSI controller with bufferedEthernet, the DSBE/S, was then introduced to replace the IPI controllerin the high-end, rack-based systems. The DSBE/S is used together withdifferential fast SCSI drives, which come in rack-mount packages. Allthe above SCSI controllers are relatively simple DMA bus master devicesthat are all supported by theesp driver in Solaris 2.

Thereplacement for the DSBE/S is the DWI/S, which is a differential, wideSCSI interface with no Ethernet added. The wide SCSI interface runs attwice the speed and can support twice as many SCSI target addresses, sothe controller can connect to two trays of disks. The DWI/S is a muchmore intelligent SCSI controller and has a newisp device driver. Theisp driver is much more efficient than theespdriver and uses fewer interrupts and less CPU power to do an I/Ooperation. More recently, the UltraSPARC-based generation of systemsmostly uses an interface, calledfas, that lies between theesp andisp in its features. It supports fast SCSI but is less efficient and costs less than theisp controller. The latest PCI bus-based version offas includes Ultra SCSI support.

Table 8-8summarizes the specifications. Real throughput is somewhat less thanthe speed quoted, in the range 27,000-33,000 Kbytes/s for a “40MB/s”UltraSCSI connection.

Table 8-8. SCSI Controller Specifications
Controller Bus Interface Speed Type Sun SCSI-II VME 1.2 MB/s Asynchronous SPARCserver 330 Built-in 5.0 MB/s Synchronous SPARCstation 1 Built-in SBus 2.5 MB/s Asynchronous SPARCstation 1+ Built-in SBus 2.5 MB/s Synchronous SPARCstation IPC       SPARCstation SLC       SPARCstation 2 Built-in SBus 5.0 MB/s Synchronous SPARCstation ELC       SPARCstation IPX       SBus SCSI X1055 SBus Add-on 5.0 MB/s Synchronous Early SPARCserver 600 Built-in SBus 5.0 MB/s Synchronous SBE/S X1054 Sbus Add-on 5.0 MB/s Synchronous SPARCstation 10 Built-in SBus (esp) 10.0 MB/s Fast SPARCclassic       SPARCstation LX       SPARCstation 5       SPARCstation 20       Ultra 1/140, 170       Late model SPARCserver 600 Built-in SBus (esp) 10.0 MB/s Fast Ultra 1/140E, 170E, 200E Built-in Sbus (fas) 20.0MB/s Fast and wide Ultra 2       E3000, E4000, E5000, E6000       Ultra 30 PCI Bus 40MB/s Ultra-SCSI E450     20MHz wide FSBE/S X1053 SBus Add-on 10.0 MB/s Fast DSBE/S X1052 SBus Add-on 10.0 MB/s Differential fast SPARCserver 1000 Built-in SBus 10.0 MB/s Fast DWI/S X1062 SBus Add-on 20.0 MB/s Differential wide RSM2000 UltraSCSI SBus Add-on 40.0 MB/s Differential SPARCstorage Array SOC SBus Add-on 25+25MB/s 2 x Fiber Channel SPARCstorage Array Internal Built-in SBus (isp) 20.0 MB/s Wide Array5000 SOC+ Sbus Add-on 100+100MB/s Fiber Channel Arbitrated Loop

The SPARCstorage Disk Array

The first fiber-based interface from Sun was the SPARCstorage Disk Array. The architecture of this subsystem is shown in Figure 8-9.The subsystem connects to a system by means of the SCSI protocol over afiber channel link. A single SBus Fiber Channel card (the SBus OpticalChannel) supports two separate connections, so that two disk arrays canbe connected to a single SBus slot.

Figure 8-9. SPARCstorage Disk Array Architecture


Several components make up the product in several combinations, as summarized in Table 8-9. The product name is a three-digit code such as 102 or 214.

  • The first digit indicates the physical package. The 100 package is a compact unit the same size as a SPARCserver 1000; the unit takes thirty 1-inch high 3.5-inch drives that are hot pluggable in trays of 10. The 200 package is a rack-mounted cabinet that uses six of the old-style differential SCSI trays with six 5.25-inch disks—36 in total. The 210RSM package uses six of the RSM trays, which take seven individually hot-pluggable 3.5-inch disks each—42 in total.

  • The middle digit in the product code indicates the controller. The original controller with 4 Mbytes of NVRAM is used in the 100 and 200 packages. The updated controller with 16 Mbytes of NVRAM is used in the 110 and 210 packages. The early models were just called the Model 100 and Model 200.

  • For later models, the third digit indicates the size of the disk in gigabytes. The model 100 was mainly shipped with 1.05 Gbyte and 2.1 Gbyte disks as models 100 and 112. The original model 200 can be connected to existing trays of 2.9 Gbyte disks, but is most often used with trays of six 9.1 Gbyte disks, for a total of 648 Gbytes per SBus slot in high-capacity installations. The RSM trays are normally used with 4.2-Gbyte or 9-Gbyte disks for a total of 756 Gbytes per SBus slot.

Table 8-9. Common SPARCstorage Array Configurations
Product Name Controller Disk Total Model 100 4 Mbytes NVRAM 30 x 1.05 Gbytes 31 Gbytes Model 112 16 Mbytes NVRAM 30 x 2.1 Gbytes 63 Gbytes Model 114 16 Mbytes NVRAM 30 x 4.2 Gbytes 126 Gbytes Model 200 4 Mbytes NVRAM 36 x 9.1 Gbytes 328 Gbytes Model 214RSM 16 Mbytes NVRAM 42 x 4.2 Gbytes 176 Gbytes Model 219RSM 16 Mbytes NVRAM 42 x 9.1 Gbytes 382 Gbytes

Eachfiber channel interface contains two fibers that carry signals inopposite directions at up to 25 Mbytes/s. This is unlike normal SCSIwhere the same set of wires is used in both directions, and it allowsconcurrent transfer of data in both directions. Within the SPARCstorageDisk Array, a microSPARC processor connects to a dual fiber channelinterface, allowing two systems to share a single array forhigh-availability configurations. A 4- or 16-Mbyte nonvolatile RAMbuffer stores data that is on its way to disk. Since the NVRAM bufferhas a battery backup, writes can safely be acknowledged as soon as theyare received, before the data is actually written to disk. ThemicroSPARC processor also controls six separate Fast and Wide SCSIchannels, each connected to five, six, or seven disks at 20 Mbytes/s.The combination of software and hardware supports any mixture ofmirroring, disk striping, and RAID5 across the drives. The SPARCstorageArray controller can sustain over 2,000 I/O operations per second(IOPS).

Thefiber channel interface is faster than most SCSI controller interfacesand supports simultaneous transfer in both directions, unlike aconventional SCSI bus. The NVRAM store provides a very fast turnaroundfor writes. Read latency is not quite as good as a directly connectedwide SCSI disk, and the ultimate throughput for sequential access islimited by the fiber capacity. These characteristics are ideally suitedto general-purpose NFS service and on-line transaction processing(OLTP) workloads such as TPC-C, which are dominated by small randomaccesses and require fast writes for good interactive performance. TheVeritas Volume Manager-derived SPARCstorage Manager software providedwith the array has a GUI that helps keep track of all the disks,monitors performance, and makes it easy to allocate space within thearray. Figure 8-10 and Figure 8-11 illustrate the graphical user interface.

Figure 8-10. SPARCstorage Array Manager Configuration Display


Figure 8-11. SPARCstorage Array Monitoring Display Screen Shot


The RSM2000, A1000 and A3000 Hardware RAID Subsystems

TheRSM2000 was renamed as the A3000 in early 1998. The A1000 is a smallRAID subsystem using a single controller, while the A3000 uses dualcontrollers with higher capacity packaging. The controller is based ona hardware RAID controller design from Symbios. Sun worked with Symbiosto upgrade the product, packaged it into Sun disk subsystems, anddeveloped firmware and device drivers for Solaris. The A3000 design isfully redundant, with no single point of failure and very highperformance in RAID5 mode. It has two Ultra SCSI connections to thehost system and can sustain 66 Mbytes/s. In the event of a failure, itcan continue to work with a single Ultra-SCSI connection. Thismechanism is handled by means of a specialrdacdevice driver that makes the two Ultra SCSI buses look like a singleconnection. A 100 Mbyte/s fiber channel connection in place of eachUltra SCSI to the host is an obvious upgrade to this product. The A1000in a smaller package with a single controller provides an entry-level,high-performance RAID5 solution. Figure 8-12illustrates the architecture. The RSM disk tray is superseded by a newpackage that includes an internal controller in the A1000, and directconnection to an external pair of controllers in the A3000.

Figure 8-12. RSM2000/A3000 Disk Array Architecture


The Enterprise Network Array 5000: A5000 FC-AL Subsystem

TheA5000 is a building block, used to construct a fiber-based storagefabric, that is very well suited to access from clusters of systems.The package is fully redundant and hot pluggable, with dynamicmultipath connections, dual power supplies, and so on. The FC-AL loopgoes right through the drives with no separate controller. The highbandwidth and lack of contention at the drive level allows each disk tosupport a 5%-10% higher random access rate—perhaps 115 IOPS rather than105 on a comparable 7200 rpm disk. It also provides the lowest latencyfor reads and the highest bandwidth for streaming sequential reads andwrites. Lack of NVRAM in the basic product can be compensated for byuse of dedicated log disks to accelerate writes. It is best used in amirrored configuration, inasmuch as there is no NVRAM for RAID5acceleration in the initial product. Figure 8-13 illustrates the architecture.

Figure 8-13. 56” Rack-Mounted A5000 Disk Array Architecture


Thereare two packaging options: The desktop package contains 14 disks withtwin loop interfaces; the rack-mounted server option contains four ofthe desktop packages combined with two 7-port fiber channel switches,so that no single point of failure can prevent access to all the disks.The dual loops from each package of 14 disks go to separate switches.The host interface has four loops, and fully redundant, high-bandwidthconnections can be made between loops of hosts and disks in clusteredconfigurations. When configured with 9.1-Gbyte disks, a 56-disk rackadds up to 510 Gbytes of disk with sustainable sequential throughput ofabout 360 Mbytes/s. These characteristics are ideal for data warehouseand high-performance computing applications in particular.

The A7000 High-End Disk Subsystem

Inlate 1997 Sun acquired the storage business of Encore corporation, andthe Encore SP-40 product was renamed the Sun A7000 as the high end ofSun’s new StorEdge product line. The A7000 is equally at home connectedover an ESCON channel to a mainframe system or connected to Unixsystems over SCSI. Uniquely, the same on-disk data can be accessed byboth types of system concurrently. This avoids any copying overhead ifmainframe data is being accessed for data warehousing applications onUnix systems. The A7000 also supports wide-area remote data copy at thestorage subsystem level, which can be used to implement distributeddisaster-resilient configurations. Unlike other Sun disk subsystems,the A7000 is configured to order by Sun personnel, contains “phonehome” modem connections for automated remote diagnostics, and isdesigned with flexibility and availability as its primarycharacteristics. With up to 4 Gbytes of mirrored cache memory and over1 terabyte per unit, the A7000 is a direct competitor to the EMCSymmetrix product line. It is also the first Sun disk subsystemsupported on non-Sun servers (mainframe, Unix, and NT), although it isplanned for the rest of Sun’s disk subsystem products to be supportedon HP-UX and NT platforms.