[Cockcroft98] Chapter 7. Applications

来源:百度文库 编辑:神马文学网 时间:2024/04/19 13:03:52

Chapter 7. Applications

Thischapter discusses the ways by which a user running an application on aSun machine can control or monitor on a program-by-program basis.

Tools for Applications

Whenyou don’t have the source code for an application, you must use specialtools to figure out what the application is really doing.

Tracing Applications

Applicationsmake frequent calls into the operating system, both to shared librariesand to the kernel via system calls. System call tracing has been afeature of Solaris for a long time, and in Solaris 2.6 a new capabilityallows tracing and profiling of the shared library interface as well.

Tracing System Calls Withtruss

The Solaris 2truss command has many features not found in the original SunOS 4tracecommand. It can trace child processes, and it can count and time systemcalls and signals. Other options allow named system calls to beexcluded or focused on, and data structures can be printed out in full.Here is an excerpt showing a fragment oftruss output with the-v option to set verbose mode for data structures, and an example oftruss -c showing the system call counts.

Code View:Scroll/Show All
% truss -v all cp NewDocument Tuning 
execve("/usr/bin/cp", 0xEFFFFB28, 0xEFFFFB38) argc = 3
open("/usr/lib/libintl.so.1", O_RDONLY, 035737561304) = 3
mmap(0x00000000, 4096, PROT_READ, MAP_SHARED, 3, 0) = 0xEF7B0000
fstat(3, 0xEFFFF768)= 0
d=0x0080001E i=29585 m=0100755 l=1 u=2 g=2 sz=14512
at = Apr 27 11:30:14 PDT 1993 [ 735935414 ]
mt = Mar 12 18:35:36 PST 1993 [ 731990136 ]
ct = Mar 29 11:49:11 PST 1993 [ 733434551 ]
bsz=8192 blks=30 fs=ufs
....
% truss -c cp NewDocument Tuning
syscall seconds calls errors
_exit .00 1
write .00 1
open .00 10 4
close .01 7
creat .01 1
chmod .01 1
stat .02 2 1
lseek .00 1
fstat .00 4
execve .00 1
mmap .01 18
munmap .00 9
memcntl .01 1
---- --- ---
sys totals: .07 57 5
usr time: .02
elapsed: .43


Anespecially powerful technique is to log all the file open, close,directory lookup, read, and write calls to a file, then figure out whatparts of the system the application is accessing. A trivial example isshown in Figure 7-1.

Figure 7-1. Example Usingtruss to Track Process File role="figure" Usage
% truss -o /tmp/ls.truss -topen,close,read,write,getdents ls / >/dev/null 
% more /tmp/ls.truss
open("/dev/zero", O_RDONLY) = 3
open("/usr/lib/libw.so.1", O_RDONLY) = 4
close(4) = 0
open("/usr/lib/libintl.so.1", O_RDONLY) = 4
close(4) = 0
open("/usr/lib/libc.so.1", O_RDONLY) = 4
close(4) = 0
open("/usr/lib/libdl.so.1", O_RDONLY) = 4
close(4) = 0
open("/usr/platform/SUNW,Ultra-2/lib/libc_psr.so.1", O_RDONLY) = 4
close(4) = 0
close(3) = 0
open("/", O_RDONLY|O_NDELAY) = 3
getdents(3, 0x0002D110, 1048) = 888
getdents(3, 0x0002D110, 1048) = 0
close(3) = 0
write(1, " T T _ D B", 5) = 5
write(1, "\n b i n\n c d r o m\n c".., 251) = 251

Tracing Shared Library Calls Withsotruss

The dynamic linker has many new features. Read theld(1)manual page for details. Two features that help with performance tuningare tracing and profiling. The Solaris 2.6 and latersotruss command is similar in use to thetrusscommand and can be told which calls you are interested in monitoring.Library calls are, however, much more frequent than system calls andcan easily generate too much output.

Profiling Shared Libraries withLD_PROFILE

TheLD_PROFILEprofiling option was new in Solaris 2.5. It allows the usage of ashared library to be recorded and accumulated from multiple commands.This data has been used to tune window system libraries andlibc.so, which is used by every command in the system. Profiling is enabled by setting theLD_PROFILE environment variable to the name of the library you wish to profile. By default, profile data accumulates in/var/tmp, but the value ofLD_PROFILE_OUTPUT, if it has one, can be used to set an alternative directory. As for a normal profile,gprofis used to process the data. Unlike the case for a normal profile, nospecial compiler options are needed, and it can be used on any program.

Forall these utilities, security is maintained by limiting their use onset-uid programs to the root user and searching for libraries only inthe standard directories.

Timing

The C shell has a built-intimecommand that is used during benchmarking or tuning to see how aparticular process is running. In Solaris 2, the shell does not computeall this data, so the last six values are always zero.

% time man madvise 
...
0.1u 0.5s 0:03 21% 0+0k 0+0io 0pf+0w
%

Inthis case, 0.1 seconds of user CPU and 0.5 seconds of system CPU wereused in 3 seconds elapsed time, which accounted for 21% of the CPU.Solaris 2 has atimex command that usessystem accounting records to summarize process activity, but thecommand works only if accounting is enabled. See the manual pages formore details.

Process Monitoring Tools

Processes are monitored and controlled via the/proc interface. In addition to the familiarps command, a set of example programs is described in theproc(1) manual page, includingptree, which prints out the process hierarchy, andptime, which provides accurate and high-resolution process timing.

% /usr/proc/bin/ptime man madvise 
real 1.695
user 0.005
sys 0.009

Notethe difference in user and system time. I first ran this command a verylong time ago on one of the early SPARC machines and, usingtime, recorded the output. The measurement above usingptimewas taken on a 300 MHz UltraSPARC, and by this measurement, the CPUresources used to view the manual page have decreased by a factor of43, from 0.6 seconds to 0.014 seconds. If you use thecsh built-intimecommand, you get zero CPU usage for this measurement on the fast systembecause there is not enough resolution to see anything under 0.1seconds.

Theptimecommand uses microstate accounting to get the high-resolutionmeasurement. It obtains but fails to print out many other usefulmeasurements. Naturally, this can be fixed by writing a script in SE toget the missing data and show it all. The script is described in “msacct.se” on page 485, and it shows that many process states are being measured. Microstate accounting itself is described in “Network Protocol (MIB) Statistics via Streams” on page 403. Themsacct.se command is given a process ID to monitor and produces the output shown in Figure 7-2.

Example 7.2. Example Display frommsacct.se
% se msacct.se 354 
Elapsed time 3:29:26.344 Current time Tue May 23 01:54:57 1995
User CPU time 5.003 System call time 1.170
System trap time 0.004 Text pfault sleep 0.245
Data pfault sleep 0.000 Kernel pfault sleep 0.000
User lock sleep 0.000 Other sleep time 9:09.717
Wait for CPU time 1.596 Stopped time 0.000

The Effect of Underlying Filesystem Type

Someprograms are predominantly I/O intensive or may open and close manytemporary files. SunOS has a wide range of filesystem types, and thedirectory used by the program could be placed onto one of the followingtypes.

Unix File System (UFS)

Thestandard file system on disk drives is the Unix File System, which inSunOS 4.1 and on is the Berkeley Fat Fast File system. Files that areread stay in RAM until a RAM shortage reuses the pages for somethingelse. Files that are written are sent out to disk as described in “Disk Writes and the UFS Write Throttle”on page 172, but the file stays in RAM until the pages are reused forsomething else. There is no special buffer cache allocation, unlikeother Berkeley-derived versions of Unix. SunOS4 and SVR4 both use thewhole of memory to cache pages of code, data, or I/O. The more RAMthere is, the better the effective I/O throughput is.

UFS with Transaction Logging

Thecombination of Solaris 2.4 and Online: DiskSuite™ 3.0 or later releasessupports a new option to standard UFS. Synchronous writes and directoryupdates are written sequentially to a transaction log that can be on adifferent device. The effect is similar to the Prestoserve, nonvolatileRAM cache, but the transaction log device can be shared with anothersystem in a dual-host, failover configuration. The filesystem checkwithfsck requires that only the log is read, so very large file systems are checked in a few seconds.

Tmpfs

Tmpfsis a RAM disk filesystem type. Files that are written are never put outto disk as long as some RAM is available to keep them in memory. Ifthere is a RAM shortage, then the pages are stored in the swap space.The most common way to use this filesystem type in SunOS 4.X is touncomment the line in/etc/rc.local formount /tmp. The/tmp directory is accelerated with tmpfs by default in Solaris 2.

One side effect of this feature is that the free swap space can be seen by means ofdf. The tmpfs file system limits itself to prevent using up all the swap space on a system.

% df /tmp 
Filesystem kbytes used avail capacity Mounted on
swap 15044 808 14236 5% /tmp

The NFS Distributed Computing File System

NFSis a networked file system coming from a disk on a remote machine. Ittends to have reasonable read performance but can be poor for writesand is slow for file locking. Some programs that do a lot of lockingrun very slowly on NFS-mounted file systems.

Cachefs

Newsince Solaris 2.3 is the cachefs filesystem type. It uses a fast filesystem to overlay accesses to a slower file system. The most useful wayto use cachefs is to mount, via a local UFS disk cache, NFS filesystems that are mostly read-only. The first time a file is accessed,blocks of it are copied to the local UFS disk. Subsequent accessescheck the NFS attributes to see if the file has changed, and if not,the local disk is used. Any writes to the cachefs file system arewritten through to the underlying files by default, although there areseveral options that can be used in special cases for betterperformance. Another good use for cachefs is to speed up accesses toslow devices like magneto-optical disks and CD-ROMs.

Whenthere is a central server that holds application binaries, thesebinaries can be cached on demand at client workstations. This practicereduces the server and network load and improves response times. Seethecfsadmin manual page for more details. Solaris 2.5 includes thecachefsstat(1M) command to report cache hit rate measures.

Caution



Cachefsshould not be used to cache-shared, NFS-mounted mail directories andcan slow down access to write-intensive home directories.


Veritas VxFS File System

Veritasprovides the VxFS file system for resale by Sun and other vendors.Compared to UFS, it has several useful features. UFS itself has somefeatures (like disk quotas) that were not in early releases of VxFS,but VxFS is now a complete superset of all the functions of UFS.

VxFS is an extent-based file system, which is completely different from UFS, an indirect, block-basedfile system. The difference is most noticeable for large files. Anindirect block-based file system breaks the file into 8-Kbyte blocksthat can be spread all over the disk. Additional 8-Kbyte indirectblocks keep track of the location of the data blocks. For files of overa few Mbytes, double indirect blocks are needed to keep track of thelocation of the indirect blocks. If you try to read a UFS filesequentially at high speed, the system has to keep seeking to pick upindirect blocks and scattered data blocks. This seek procedure limitsthe maximum sequential read rate to about 30 Mbytes/s, even with thefastest CPU and disk performance.

Incontrast, VxFS keeps track of data by using extents. Each extentcontains a starting point and a size. If a 2-Gbyte file is written toan empty disk, it can be allocated as a single 2-Gbyte extent. Thereare no indirect blocks, and a sequential read of the file reads anextent record, then reads data for the complete extent. In November1996, Sun published a benchmark result, using VxFS on an E6000 system,where a single file was read at a sustained rate of about 1 Gbyte/s.The downside of large extents is that they fragment the disk. Afterlots of files have been created and deleted, it could be difficult toallocate a large file efficiently. Veritas provides tools with VxFSthat de-fragment the data in a file system by moving and mergingextents.

The secondadvanced feature provided is snapshot backup. If you want a consistentonline backup of a file system without stopping applications that arewriting new data, you tell VxFS to snapshot the state of the filesystem at that point. Any new data or deletions are handled separately.You can back up the snapshot, freeing the snapshot later when thebackup is done and recovering the extra disk space used by the snapshot.

Direct I/O Access

Insome cases, applications that access very large files do not want thembuffered by the normal filesystem code. They can run better with rawdisk access. Raw access can be administratively inconvenient becausethe raw disk partitions are hard to keep track of. The simplest fix forthis situation is to use a volume management GUI such as the VeritasVxVM to label and keep track of raw disk space. A need for many smallraw files could still be inconvenient, so options are provided fordirect I/O access, unbuffered accesses to a file in a normal filesystem. The VxFS extent is closer in its on-disk layout to raw, so thedirect I/O option is reasonably fast. A limitation is that the VxFSextent can only be used for block-aligned reads and writes. The VxFSfile system can be used with Solaris 2.5.1.

UFSdirectio is a new feature in Solaris 2.6; seemount_ufs(1M). UFS still suffers from indirect blocks and fragmented data placement, sodirectio access is less efficient than raw. A useful feature is thatdirectioreverts automatically to buffer any unaligned access to an 8-Kbyte diskblock, allowing a mixture of direct and buffered accesses to the samefile.

Customizing the Execution Environment

Theexecution environment is largely controlled by the shell. There is acommand which can be used to constrain a program that is hogging toomany resources. Forcsh the command islimit; forsh andksh the command isulimit. A default set of Solaris 2 resource limits is shown in Table 7-1.

Users can increase limits up to the hard system limit. The superuser can set higher limits. The limits on data size and stack sizeare 2 Gbytes on recent machines with the SPARC Reference MMU but arelimited to 512 Mbytes and 256 Mbytes respectively by the sun4c MMU usedin the SPARCstation 1 and 2 families of machines.

Table 7-1. Resource Limits
Resource Name Soft User Limit Hard System Limit cputime unlimited unlimited filesize unlimited unlimited datasize 524280–2097148 Kbytes 524280–2097148 Kbytes stacksize 8192 Kbytes 261120–2097148 Kbytes coredumpsize unlimited unlimited descriptors 64 1024 memorysize (virtual) unlimited unlimited

In Solaris 2.6, you can increase thedatasize to almost 4 Gbytes. Thememorysizeparameter limits the size of the virtual address space, not the realusage of RAM, and can be useful to prevent programs that leak fromconsuming all the swap space.

Useful changes to the defaults are those made to prevent core dumps from happening when they aren’t wanted:

% limit coredumpsize 0

To run programs that use vast amounts of stack space:

% limit stacksize unlimited

File Descriptor Limits

Torun programs that need to open more than 64 files at a time, you mustincrease the file descriptor limit. The safest way to run such aprogram is to start it from a script that sets the soft user limithigher or to use thesetrlimit call to increase the limit in the code:

% limit descriptors 256

Themaximum number of descriptors in SunOS 4.X is 256. This maximum wasincreased to 1024 in Solaris 2, although the standard I/O package stillhandles only 256. The definition of FILE in/usr/include/stdio.hhas only a single byte to record the underlying file descriptor index.This data structure is so embedded in the code base that it cannot beincreased without breaking binary compatibility for existingapplications. Raw file descriptors are used for socket-based programs,and they can use file descriptors above thestdio.hlimit. Problems occur in mixed applications when stdio tries to open afile when sockets have consumed all the low-numbered descriptors. Thissituation can occur if the name service is invoked late in a program,as thensswitch.conf file is read via stdio.

At higher levels, additional problems occur. Theselect(3C)system call uses a bitfield with 1024 bits to track which filedescriptor is being selected. It cannot handle more than 1,024 filedescriptors and cannot be extended without breaking binarycompatibility. Some system library routines still useselect, including some X Window System routines. The official solution is to use the underlyingpoll(2)system call instead. This call avoids the bitfield issue and can beused with many thousands of open files. It is a very bad idea toincrease the default limits for a program unless you know that it issafe. Programs should increase limits themselves by usingsetrlimit.If the programs run as root, they can increase the hard limit as wellas implement daemons which need thousands of connections.

Theonly opportunity to increase these limits comes with the imminent64-bit address space ABI. It marks a clean break with the past, so someof the underlying implementation limits in Solaris can be fixed at thesame time as 64-bit address support is added. The implications arediscussed in “When Does “64 Bits” Mean More Performance?” on page 134.

Databases and Configurable Applications

Examplesof configurable applications include relational databases, such asOracle, Ingres, Informix, and Sybase, that have large numbers ofconfiguration parameters and an SQL-based configuration language. ManyCAD systems and Geographical Information systems also havesophisticated configuration and extension languages. This sectionconcentrates on the Sun-specific database issues at a superficiallevel; the subject of database tuning is beyond the scope of this book.

Hire an Expert!

Forserious tuning, you either need to read all the manuals cover-to-coverand attend training courses or hire an expert for the day. The blackbox mentality of using the system exactly the way it came off the tape,with all parameters set to default values, will get you going, butthere is no point in tuning the rest of the system if it spends 90percent of its time inside a poorly configured database. Experienceddatabase consultants will have seen most problems before. They knowwhat to look for and are likely to get quick results. Hire them,closely watch what they do, and learn as much as you can from them.

Basic Tuning Ideas

Severaltimes I have discovered database installations that have not evenstarted basic tuning, so some basic recommendations on the first thingsto try may be useful. They apply to most database systems in principle,but I will use Oracle as an example, as I have watched over theshoulders of a few Oracle consultants in my time.

Increasing Buffer Sizes

Oracleuses an area of shared memory to cache data from the database so thatall Oracle processes can access the cache. In old releases, the cachedefaults to about 400 Kbytes, but it can be increased to be bigger thanthe entire data set if needed. I recommend that you increase it to atleast 20%, and perhaps as much as 50% of the total RAM in a dedicateddatabase server if you are using raw disk space to hold the databasetables. There are ways of looking at the cache hit rate within Oracle,so increase the size until the hit rate stops improving or until therest of the system starts showing signs of memory shortage. Avoidingunnecessary random disk I/O is one of the keys to database tuning.

Solaris 2 implements a feature called intimate shared memoryby which the virtual address mappings are shared as well as thephysical memory pages. ISM makes virtual memory operations and contextswitching more efficient when very large, shared memory areas are used.In Solaris 2, ISM is enabled by the application when it attaches to theshared memory region. Oracle 7 and Sybase System 10 and later releasesboth enable ISM automatically by setting the SHM_SHARE_MMU flag in theshmat(2)call. In Solaris 2.6 on UltraSPARC systems, the shared memory segmentis mapped by use of large (4 Mbyte) pages of contiguous RAM rather thanmany more individual 8-Kbyte pages. This mapping scheme greatly reducesmemory management unit overhead and saves on CPU system time.

Using Raw Disk Rather Than File Systems

Duringinstallation, you should create several empty disk partitions orstripes, spread across as many different disks and controllers aspossible (but avoiding slices zero and two). You can then change theraw devices to be owned by Oracle (do this by using the VxVM GUI if youcreated stripes) and, when installing Oracle, specify the raw devicesrather than files in the file system to hold the system, redo logs,rollback, temp, index, and data table spaces.

Filesystems incur more CPU overhead than do raw devices and can be muchslower for writes due to inode and indirect block updates. Two or threeblocks in widely spaced parts of the disk must be written to maintainthe file system, whereas only one block needs to be written on a rawpartition. Oracle normally uses 2 Kbytes as its I/O size, and the filesystem uses 8 Kbytes, so each 2-Kbyte read is always rounded up to 8Kbytes, and each 2 Kbyte write causes an 8-Kbyte read, 2-Kbyte insertand 8-Kbyte write sequence. You can avoid this excess by configuring an8-Kbyte basic block size for Oracle, but this solution wastes memoryand increases the amount of I/O done while reading the small items thatare most common in the database. The data will be held in the OracleSGA as well as in the main memory filesystem cache, thus wasting RAM.Improvements in the range of 10%–25% or more in database performanceand reductions in RAM requirements have been reported after a move fromfile systems to raw partitions. A synchronous write accelerator, see “Disk Write Caching” on page 173, should be used with databases to act as a database log file accelerator.

Ifyou persist in wanting to run in the file system, three tricks may helpyou get back some of the performance. The first trick is to turn on the“sticky bit” for database files. This makes the inode updates for thefile asynchronous and is completely safe because the file ispreallocated to a fixed size. This trick is used by swap files; if youlook at files created withmkfile by the root user, they always have the sticky bit set.

# chmod +t table 
# ls -l table
-rw------T 1 oracle dba 104857600 Nov 30 22:01 table

The second trick is to use the direct I/O option discussed in “Direct I/O Access”on page 161. This option at least avoids the memory double bufferingoverhead. The third trick is to configure the temporary tablespace tobe raw; there is often a large amount of traffic to and from thetemporary tablespace, and, by its nature, it doesn’t need to be backedup and it can be re-created whenever the database starts up.

Fast Raw Backups

Youcan back up small databases by copying the data from the raw partitionto a file system. Often, it is important to have a short downtime fordatabase backups, and a disk-to-disk transfer is much faster than abackup to tape. Compressing the data as it is copied can save on diskspace but is very CPU intensive; I recommend compressing the data ifyou have a high-end multiprocessor machine. For example,

# dd if=/dev/rsd1d bs=56k | compress > /home/data/dump_rsd1d.Z

Balance the Load over All the Disks

Thelog files should be on a separate disk from the data. This separationis particularly important for databases that have a lot of updateactivity. It also helps to put indexes and temporary tablespace ontheir own disks or to split the database tables over as many disks aspossible. The operating system disk is often lightly used, and on avery small two-disk system, I would put the log files on the systemdisk and put the rest on its own disk. To balance I/O over a largernumber of disks, stripe them together by using Veritas VxVM, SolsticeDiskSuite, or a hardware RAID controller. Also see “Disk Load Monitoring” on page 183.

Which Disk Partition to Use

Ifyou use the first partition on a disk as a raw Oracle partition, thenyou will lose the disk’s label. If you are lucky, you can recover thisloss by using the “search for backup labels” option of theformatcommand, but you should put a file system, swap space, SolsticeDiskSuite state database, or small, unused partition at the start ofthe disk.

On moderndisks, the first part of the disk is the fastest, so, for bestperformance, I recommend a tiny first partition followed by a databasepartition covering the first half of the disk. See “Zoned Bit Rate (ZBR) Disk Drives” on page 205 for more details and an explanation.

The Effect of Indices

Whenyou look up an item in a database, your request must be matched againstall the entries in a (potentially large) table. Without an index, afull table scan must be performed, and the database reads the entiretable from disk in order to search every entry. If there is an index onthe table, then the database looks up the request in the index andknows which entries in the table need to be read from disk. Somewell-chosen indexes can dramatically reduce the amount of disk I/O andCPU time required to perform a query. Poorly designed or untuneddatabases are often underindexed. The problem with indexes is that whenan indexed table is updated, the index must be updated as well, so peakdatabase write performance can be reduced.

How to Configure for a Large Number of Users

Oneconfiguration scenario is for the users to interact with the databasethrough an ASCII forms-based interface. The forms’ front end is usuallycreated by means of high-level, application-builder techniques and insome cases can consume a large amount of CPU. This forms front endinputs and echoes characters one at a time from the user over a directserial connection or via telnet from a terminal server. Output tends tobe in large blocks of text. The operating system overhead of handlingone character at a time over telnet is quite high, and when hundreds ofusers are connected to a single machine, the Unix kernel consumes a lotof CPU power moving these characters around one at a time. In Solaris2.5, the telnet and rlogin processing was moved into the kernel bymeans of streams modules. The old implementation uses a pair of daemonprocesses, one for each direction of each connection; the in-kernelversion still has a single daemon for handling the protocol, but datatraffic does not flow through the daemon. This configuration has beentested with up to 3,000 direct connections. Higher numbers are normallyconfigured by a Transaction Processing Monitor, such as Tuxedo.

Themost scalable form of client-server configuration is for each user tohave a workstation or a PC running the forms-based application andgenerating SQL calls directly to the backend server. Even more userscan be connected this way because they do not login to the databaseserver and only a socket connection is made.

Database Tuning Summary

Whenyou are tuning databases, it is useful to realize that in many casesthe sizing rules that have been developed by database software vendorsin the past do not scale well to today’s systems. In the mainframe andminicomputer worlds, disk I/O capacity is large, processors are slow,and RAM is expensive. With today’s systems, the disk I/O capacity, CPUpower, and typical RAM sizes are all huge, but the latency for a singledisk read is still very slow in comparison. It is worth trading off alittle extra CPU overhead and extra RAM usage for a reduction in I/Orequirements, so don’t be afraid to experiment with database buffersizes that are much larger than those recommended in the databasevendors’ documentation.

The next chapter examines the reasons why disk I/O is often the problem.