[Cockcroft98] Chapter 13. RAM and Virtual Memory

来源:百度文库 编辑:神马文学网 时间:2024/04/28 10:50:52

Chapter 13. RAM and Virtual Memory

Thepaging algorithm manages and allocates physical memory pages to holdthe active parts of the virtual address space of processes running onthe system. First, the monitoring statistics are explained; then, thealgorithms for paging and swap are explained. The most important reasonto understand paging is that you can then tell when adding more RAMwill make a difference to performance.

Memory Usage and Sizing Tools

Thereare four primary consumers of memory: the kernel, processes, filesystemcaches, and System V shared memory. Kernel memory usage is described in“Kernel Memory Allocation” on page 365. Process memory usage is described in “Process Memory Usage Information”on page 421. Filesystem caches invisibly consume large amounts ofmemory; there are no measurements of this usage in Solaris 2. System VShared Memory can be observed by means of theipcs command.

The Solaris Memory System is a white paper available via http://www.sun.com/sun-on-net/performance.html,written by my colleague Richard McDougall. It explains how to measureand size memory usage. Richard has also written a loadable kernelmodule that instruments memory usage by processes and the file systemin detail and a set of tools that view this data. Thememtool package has also been known internally at Sun as thebunyip[1] tool for the last few years. The tools can be obtained by sending email to memtool-request@chessie.eng.sun.com.The loadable kernel module is unsupported and should not be used inproduction environments, although it is suitable for use in benchmarksand sizing exercises. There is a significant risk that in the futurethe module could interact with updated kernel patches and cause amachine to crash. At present (early 1998), modules that work reliablywith current patch levels of Solaris 2.5.1 and Solaris 2.6 are provided.

[1] A Bunyip is kind of Australian Loch Ness Monster that lives in a lake in the outback.

The process memory usage command/usr/proc/bin/pmap -x is provided only in Solaris 2.6. It is based on code written by Richard, and hismemtool package includes this functionality for Solaris 2.5.1. Filesystem cache memory usage is instrumented in detail bymemtool.When the process and file information is cross-referenced, it ispossible to see how much of each executable file and shared library isactually in memory, as well as how much of that memory is mapped toeach individual process. You can also see all the space taken up bydata files and watch the page scanner reclaiming idle memory from filesthat are not being accessed often. These measurements consumesignificant amounts of CPU time, comparable to running apscommand, so should not be run continuously. They do provide a veryvaluable insight into memory usage, and work is in progress onextending the supported Solaris tools to add more information on memoryusage in future releases.

Understandingvmstat andsar Output

The units used by thevmstat andsar commands are not consistent. Sometimes they use pages,sar tends to use blocks, where a block is 512 bytes, andvmstat tends to use Kbytes defined as 1024 bytes. Page sizes are not constant either. As described in “Memory Management Unit Designs”on page 266, the original SPARC systems used 8-Kbyte pages. Designsthen switched to 4 Kbytes for a long time, the same size used by Intelx86 and other CPU designs. With UltraSPARC there was a return to 8Kbytes. A larger page reduces MMU activity, halves the number of pagefaults required to load a given amount of data, and saves kernel memoryby halving the size of tables used to keep track of pages. It alsowastes RAM by rounding up small segments of memory to a larger minimumsize, but with today’s typical memory sizes, this is a very minorproblem. One problem is that collected data may not tell you the pagesize. If you collect asar binary file on an UltraSPARC server and display it by usingsaron a SPARCstation 5, the conversion of free swap space from pages to512 byte blocks will be wrong. You can obtain the page size on eachsystem by callinggetpagesize(3C) or running thepagesize command.

% pagesize 
8192

Memory Measurements

The paging activity on a Sun can be monitored by means ofvmstat orsar.sar is better for logging the information, butvmstatis more concise and crams more information into each line of text forinteractive use. There is a simple correspondence between most of thevalues reported bysar andvmstat, as compared below. The details of some parts ofvmstat and the underlying per-CPU kernel data are covered in “The Data Behind vmstat and mpstat Output” on page 229.

Swap Space:vmstat swap,sar -r freeswap, and swap -s

vmstat swap shows the available swap in Kbytes,sar -r freeswap shows the available swap in 512-byte blocks, andswap -s shows several measures including available swap. When available swap is exhausted, the system will not be able to use more memory. See “Swap Space” on page 339.

Free Memory:vmstat free and sar -r freemem

vmstatreports the free memory in Kbytes, that is, the pages of RAM that areimmediately ready to be used whenever a process starts up or needs morememory.sar reportsfreemem in pages. The kernel variablefreememthat they report is discussed later. The absolute value of thisvariable has no useful meaning. Its value relative to some other kernelthresholds is what is important.

Reclaims:vmstat re

Reclaimsare the number of pages reclaimed from the free list. The page had beenstolen from a process but had not yet been reused by a differentprocess so it can be reclaimed by the process, thus avoiding a fullpage fault that would need I/O. This procedure is described in “The Life Cycle of a Typical Physical Memory Page” on page 326.

Page Fault Measurements

Thesar paging display is shown in Figure 13-1. It includes only counters for page in operations and minor faults. Thevmstat andsar measurements are compared below.

Figure 13-1. Examplesar Paging Display
% sar -p 1 
09:05:56 atch/s pgin/s ppgin/s pflt/s vflt/s slock/s
09:05:57 0.00 0.00 0.00 7.92 3.96 0.00

Attaches to Existing Pages:vmstat at and sar -p atch

Thesecommands measure the number of attaches to shared pages already in useby other processes; the reference count is incremented. Theat data is shown only by the old SunOS 4vmstat.

Pages Paged In:vmstat pi and sar -p pgin, ppgin

vmstat pi reports the number of Kbytes/s, andsarreports the number of page faults and the number of pages paged in byswap space or filesystem reads. Since the filesystem block size is 8Kbytes, there may be two pages or 8 Kbytes, paged in per page fault onmachines with 4-Kbyte pages. UltraSPARC uses 8-Kbyte pages; almost allother CPU types use 4-Kbyte pages.

Minor Faults:vmstat mf and sar -p vflt

Aminor fault is caused by an address space or hardware addresstranslation fault that can be resolved without performing a page-in. Itis fast and uses only a little CPU time. Thus, the process does nothave to stop and wait, as long as a page is available for use from thefree list.

Other Fault Types:sar -p pflt, slock,vmstat -s copy-on-write, zero fill

Thereare many other types of page faults. Protection faults are caused byillegal accesses of the kind that produce “segmentation violation -core dumped” messages. Copy-on-write and zero-fill faults are describedin “The Life Cycle of a Typical Physical Memory Page” on page 326. Protection and copy-on-write together make upsar -p pflt. Faults caused by software locks held on pages are reported bysar -p slock. The total counts of several of these fault types are reported byvmstat -s.

Page-out Measurements

The sar page-out display is shown in Figure 13-2.

Figure 13-2. Examplesar Page-out Display
% sar -g 1 
09:38:04 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
09:38:05 0.00 0.00 0.00 0.00 0.00

Pages Paged Out:vmstat po and sar -g pgout, ppgout

vmstat po reports the number of Kbytes/s, andsarreports the number of pageouts and the number of pages paged out to theswap space or file system. Because of the clustering that occurs onswap space writes, there may be very many pages written per page-out. “Swap Space Operations” on page 339 describes how this works.

Pages Freed:vmstat fr and sar -g pgfree

Pages freed is the rate at which memory is being put onto the free list by the page scanner daemon.vmstat fr is in Kbytes freed per second, andsar -g pgfreeis in pages freed per second. Pages are usually freed by the pagescanner daemon, but other mechanisms, such as processes exiting, alsofree pages.

The Short-Term Memory Deficit:vmstat de

Deficitis a paging parameter that provides some hysteresis for the pagescanner when there is a period of high memory demand. If the value isnon-zero, then memory was recently being consumed quickly, and extrafree memory will be reclaimed in anticipation that it will be needed.This situation is described further in “Free List Performance Problems and Deficit” on page 332.

The Page Daemon Scanning Rate:vmstat sr and sar -g pgscan

Thesecommands report the number of pages scanned by the page daemon as itlooks for pages to steal from processes that aren’t using them often.This number is the key memory shortage indicator if it stays above 200pages per second for long periods.

Pages Freed by the Inode Cache

ufs_ipfmeasures UFS inode cache reuse, which can cause pages to be freed whenan inactive inode is freed. It is the number of inodes with pages freed(ipf) as a percentage of the total number of inodes freed.

Swapping Measurements

Thesar swapping display is shown in Figure 13-3. It also includes the number of process context switches per second.

Figure 13-3. Examplesar Swapping Display
% sar -w 1 
09:11:00 swpin/s bswin/s swpot/s bswot/s pswch/s
09:11:01 0.00 0.0 0.00 0.0 186

Pages Swapped In:vmstat -S si and sar -w swpin, bswin

vmstat -S si reports the number of Kbytes/s swapped in,sar -w swpin reports the number of swap-in operations, andsar -w bswin reports the number of 512-byte blocks swapped in.

Pages Swapped Out:vmstat so and sar -w swpot, bswot

vmstat -S so reports the number of Kbytes/s swapped out, sar -w swpot reports the number of swap-out operations, andsar -w bswot reports the number of 512-byte blocks swapped out.

Examplevmstat Output Walk-through

To illustrate the dynamics ofvmstat output, Figure 13-4 presents an annotatedvmstatlog taken at 5-second intervals. It was taken during an EmpowerRTE-driven 200 user test on a SPARCserver 1000 configured with Solaris2.2, 128 Mbytes of RAM, and four CPUs. Emulated users logged in at5-second intervals and ran an intense, student-style,software-development workload, that is, edit, compile, run, core dump,debug, look at man pages, and so forth. I have highlighted in bold typethe numbers in thevmstat output thatI take note of as the test progresses. Long sequences of output havebeen replaced by comments to reduce the log to a manageable length. Asis often the case with Empower-driven workloads, the emulation of 200dedicated users who never stop to think for more than a few seconds ata time provides a much higher load than 200 real-life students.

Figure 13-4. ExampleVmstat 5 Output for a SPARCserver 1000 with 200 Users
Code View: Scroll / Show All
 procs      memory            page             disk           faults       cpu 
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 0 0 330252 80708 0 2 0 0 0 0 0 0 0 0 1 18 107 113 0 1 99
0 0 0 330252 80708 0 0 0 0 0 0 0 0 0 0 0 14 87 78 0 0 99
...users begin to log in to the system
4 0 0 320436 71448 0 349 7 0 0 0 0 2 1 0 12 144 4732 316 65 35 0
6 0 0 318820 69860 0 279 25 0 0 0 0 0 0 0 2 54 5055 253 66 34 0
7 0 0 317832 68972 0 275 3 0 0 0 0 1 0 0 1 48 4920 278 64 36 0
...lots of minor faults are caused by processes starting up
50 0 0 258716 14040 0 311 2 0 0 0 0 1 0 0 1 447 4822 306 59 41 0
51 0 0 256864 12620 0 266 2 0 0 0 0 3 1 0 12 543 3686 341 66 34 0
...at this point the free list drops below 8MB and the pager starts to scan
56 0 0 251620 8352 0 321 4 1 1 0 0 1 1 0 1 461 4837 342 57 43 0
60 0 0 238280 5340 5 596 1 371 1200 0 4804 0 0 0 6 472 3883 313 48 52 0
59 0 0 234624 10756 97 172 0 1527 1744 0 390 0 0 0 14 507 4582 233 59 41 0
60 0 0 233668 10660 9 297 2 0 0 0 0 4 2 0 12 539 5223 272 57 43 0
61 0 0 232232 8564 2 225 0 75 86 0 87 0 0 0 2 441 3697 217 71 29 0
62 0 0 231216 8248 2 334 11 500 547 0 258 1 0 0 7 484 5482 292 52 48 0
...some large processes exit, freeing up RAM and swap space
91 0 0 196868 7836 0 227 8 511 852 0 278 1 7 0 5 504 5278 298 50 50 0
91 1 0 196368 8184 1 158 3 1634 2095 0 652 0 37 0 5 674 3930 325 50 50 0
92 0 0 200932 14024 0 293 85 496 579 0 42 0 17 0 21 654 4416 435 47 53 0
93 0 0 208584 21768 1 329 9 0 0 0 0 0 0 0 3 459 3971 315 62 38 0
92 1 0 208388 20964 0 328 12 0 0 0 0 3 3 0 14 564 5079 376 53 47 0
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
...it was steady like this for a long time. RAM is OK, need more CPUs
189 0 0 41136 8816 3 99 32 243 276 0 168 1 1 0 9 500 3804 235 67 33 0
190 0 0 40328 8380 6 65 76 0 0 0 0 3 2 0 19 541 3666 178 71 29 0
190 0 0 40052 7976 1 56 102 58 65 0 32 0 1 0 15 457 3415 158 72 28 0
...users exit causing an I/O block as closing files flushes changes to disk
57 14 0 224600 55896 5 114 284 0 0 0 0 0 1 0 69 843 368 436 84 16 0
39 10 0 251456 61136 37 117 246 0 0 0 0 1 4 0 70 875 212 435 81 19 0
19 15 0 278080 65920 46 129 299 0 0 0 0 0 1 0 74 890 223 454 82 18 0
3 5 0 303768 70288 23 88 248 0 0 0 0 0 1 0 59 783 324 392 60 11 29
0 1 0 314012 71104 0 47 327 0 0 0 0 0 3 0 47 696 542 279 12 5 83


Virtual Memory Address Space Segments

All the RAM in a system is managed in terms of pages; these are used to hold the physical address space of a process. The kernel manages the virtual address spaceof a process by maintaining a complex set of data structures. Eachprocess has a single-address space structure, with any number ofsegment data structures to map each contiguous segment of memory to abacking device, which holds the pages when they are not in RAM. Thesegment keeps track of the pages that are currently in memory, and fromthis information, the system can produce the PMEG or PTE datastructures that each kind of MMU reads directly to do thevirtual-to-physical translations. The machine-independent routines thatdo this are called the hat (hardware address translation) layer. This architecture is generic to SunOS 4, SVR4, and Solaris 2. Figure 13-5illustrates the concept. This diagram has been greatly simplified. Inpractice, there are many more segments than are shown here!

Figure 13-5. Simplified Virtual Memory Address Space and File Mappings


Thecode segment contains executable instructions, and the data segmentcontains pre-initialized data, such as constants and strings. BSSstands for Block Starting with Symbol and holds uninitialized data thatwill be zero to begin with. In this diagram, the kernel is shown at thetop of the address space. This placement is true for older systems, butUltraSPARC uses a completely separate address space for the kernel.

  • Create Bookmark (Key: b)Create Bookmark
  • Create Note or Tag (Key: t)Create Note or Tag
  • Download (Key: d)Download
  • PrintPrint
  • Html View (Key: h)Html View
  • Zoom Out (Key: -)Zoom Out
  • Zoom In (Key: +)Zoom In
  • Toggle to Full Screen (Key: f)
  • Previous (Key: p)Previous
  • Next (Key: n)Next

Related Content

Virtual Memory and Segments
From: Programming the Cell Processor: For Games, Graphics, and Computation

Memory Spaces in Dynamic C
From: Embedded Software

What the OS Does with Your a.out
From: Expert C Programming: Deep C Secrets

Shared memory segments allocation order
From: Developing and Porting C and C++ Applications on AIX

Real Mode Segmentation
From: Linux Assembly Language Programming

The Intel 80x86 Memory Model and How It Got That Way
From: Expert C Programming: Deep C Secrets

Assembly and the IA-32 Processor
From: Reverse Engineering Code with IDA Pro


The Life Cycle of a Typical Physical Memory Page

Thissection provides additional insight into the way memory is used andmakes the following sections easier to understand. The sequencedescribed is an example of some common uses of pages; there are manyother possibilities.

  1. Initialization — A Page Is Born

    When the system boots, all free memory is formed into pages, and a kernel data structure is allocated to hold the state of every page in the system.

  2. Free — An Untouched Virgin Page

    All the memory is put onto the free list to start with. At this stage, the content of the page is undefined.

  3. ZFOD — Joining a BSS Segment

    When a program accesses a BSS segment for the very first time, a minor page fault occurs and a Zero Fill On Demand (ZFOD) operation takes place. The page is taken from the free list, block-cleared to contain all zeroes, and added to the list of anonymous pages for the BSS segment. The program then reads and writes data to the page.

  4. Scanned — The pagedaemon Awakes

    When the free list gets below a certain size, the pagedaemon starts to look for memory pages to steal from processes. It looks at every page in physical memory order; when it gets to this page, the page is synchronized with the MMU and a reference bit is cleared.

  5. Waiting — Is the Program Really Using This Page Right Now?

    There is a delay that varies depending upon how quickly the pagedaemon scans through memory. If the program references the page during this period, the MMU reference bit is set.

  6. Page-out Time — Saving the Contents

    The pageout daemon returns and checks the MMU reference bit to find that the program has not used the page, so it can be stolen for reuse. The page is checked to see if anything had been written to it since it does contain data in this case; a page-out occurs. The page is moved to the page-out queue and marked as I/O pending. The swapfs code clusters the page together with other pages on the queue and writes the cluster to the swap space. The page is then free and is put on the free list again. It remembers that it still contains the program data.

  7. Reclaim — Give Me Back My Page!

    Belatedly, the program tries to read the page and takes a page fault. If the page had been reused by someone else in the meantime, a major fault would occur and the data would be read from the swap space into a new page taken from the free list. In this case, the page is still waiting to be reused, so a minor fault occurs, and the page is moved back from the free list to the program’s BSS segment.

  8. Program Exit — Free Again

    The program finishes running and exits. The BSS segments are private to that particular instance of the program (unlike the shared code segments), so all the pages in the BSS segment are marked as undefined and put onto the free list. This is the same as Step 2.

  9. Page-in — A Shared Code Segment

    A page fault occurs in the code segment of a window system shared library. The page is taken off the free list, and a read from the file system is scheduled to get the code. The process that caused the page fault sleeps until the data arrives. The page is attached to the vnode of the file, and the segments reference the vnode.

  10. Attach — A Popular Page

    Another process using the same shared library page faults in the same place. It discovers that the page is already in memory and attaches to the page, increasing its vnode reference count by one.

  11. COW — Making a Private Copy

    If one of the processes sharing the page tries to write to it, a copy-on-write (COW) page fault occurs. Another page is grabbed from the free list, and a copy of the original is made. This new page becomes part of a privately mapped segment backed by anonymous storage (swap space) so it can be changed, but the original page is unchanged and can still be shared. Shared libraries contain jump tables in the code that are patched, using COW as part of the dynamic linking process.

  12. File Cache — Not Free

    The entire window system exits, and both processes go away. This time the page stays in use, attached to the vnode of the shared library file. The vnode is now inactive but will stay in its cache until it is reused, and the pages act as a file cache in case the user is about to restart the window system again. [2]

    [2] This subject was covered in “Vnodes, Inodes, and Rnodes” on page 360. The file system could be UFS or NFS.

  13. Fsflush — Flushed by the Sync

    Every 30 seconds, all the pages in the system are examined in physical page order to see which ones contain modified data and are attached to a vnode. Any modified pages will be written back to the file system, and the pages will be marked as clean.

This example sequence can continue from Step4 or Step9 with minor variations. Thefsflush process occurs every 30 seconds by default for all pages, and whenever the free list size drops below a certain value, thepagedaemon scanner wakes up and reclaims some pages.

Free Memory—The Memory-Go-Round

Pagesof physical memory circulate through the system via “free memory”; theconcept seems to confuse a lot of people. This section explains thebasic memory flows and describes the latest changes in the algorithm.The section also provides some guidance on when free memory may needtuning and what to change.

First, let’s look at somevmstat output again.

Code View:Scroll/Show All
% vmstat 5 
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s5 -- in sy cs us sy id
0 0 0 480528 68056 0 3 5 2 2 0 0 65 12 9 0 165 968 101 2 2 95
0 0 0 476936 85768 0 15 107 0 0 0 0 3 4 7 0 465 1709 231 6 3 91
0 0 0 476832 85160 0 31 144 0 0 0 0 7 0 9 0 597 3558 367 8 6 87
0 0 0 476568 83840 0 7 168 0 0 0 0 4 1 6 0 320 796 155 6 1 93
0 0 0 476544 83368 0 0 28 0 0 0 0 1 2 0 0 172 1739 166 10 5 85


The first thing to remember is thatvmstatprints out averaged rates, based on the difference between twosnapshots of a number of kernel counters. The first line of output isprinted immediately; hence, it is the difference between the currentvalue of the counters and zero. This value works out to be the averagesince the system was booted. Second, if you runvmstatwith a small time interval, you tend to catch all the short-term peaksand get results that are much more variable from one line to the next.I typically use five or ten seconds but may use one second to try andcatch peaks or as much as 30 to 60 seconds to just “keep an eye” on asystem.vmstat is a useful summary of what is happening on a system. The units ofswap,free,pi,po,fr, andde are in Kbytes, andre,mf, andsr are in pages.

The Role of the Free List

Memoryis one of the main resources in the system. There are four mainconsumers of memory, and when memory is needed, they all obtain it fromthe free list. Figure 13-6shows these consumers and how they relate to the free list. When memoryis needed, it is taken from the head of the free list. When memory isput back on the free list, there are two choices. If the page stillcontains valid data, it is put on the tail of the list so it will notbe reused for as long as possible. If the page has no useful content,it is put at the head of the list for immediate reuse. The kernel keepstrack of valid pages in the free list so that they can be reclaimed iftheir content is requested, thereby saving a disk I/O.

Figure 13-6. The Memory-Go-Round


Thevmstatreclaim counter is two-edged. On one hand, it is good that a page faultwas serviced by a reclaim, rather than a page-in that would cause adisk read. On the other hand, you don’t want active pages to be stolenand end up on the free list in the first place. Thevmstat freevalue is simply the size of the free list, in Kbytes. The way the sizevaries is what tends to confuse people. The most important valuereported byvmstat is the scan rate -sr.If it is zero or close to zero, then you can be sure that the systemdoes have sufficient memory. If it is always high (hundreds tothousands of pages/second), then adding more memory is likely to help.

Kernel Memory

Atthe start of the boot process, the kernel takes two or three megabytesof initial memory and puts all the rest on the free list. As the kerneldynamically sizes itself and loads device drivers and file systems, ittakes memory from the free list. Kernel memory is normally locked andcannot be paged out in a memory shortage, but the kernel does freememory if there is a lot of demand for it. Unused device drivers willbe unloaded, and unused slabs of kernel memory data structures will befreed. I notice that sometimes during heavy paging there is a pop fromthe speaker on a desktop system. This occurs when the audio devicedriver is unloaded. If you run a program that keeps the device open,that program cannot be unloaded.

One problem that can occur is reported as a kernel memory allocation error bysar -kand by the kernel rule in the SE toolkit. There is a limit on the sizeof kernel memory, but at 3.75 Gbytes on UltraSPARC systems, this limitis unlikely to be reached. The most common cause of this problem isthat the kernel tried to get some memory while the free list wascompletely empty. Since the kernel cannot always wait, this attempt maycause operations to fail rather than be delayed. The streams subsystemcannot wait, and I have seen remote logins fail due to this issue whena large number of users try to log in at the same time. In Solaris2.5.1 and the current kernel jumbo patch for Solaris 2.3, 2.4, and 2.5,changes were made to make the free list larger on big systems and totry and prevent it from ever being completely empty.

Filesystem Cache

Thesecond consumer of memory is files. For a file to be read or written,memory must be obtained to hold the data for the I/O. While the I/O ishappening, the pages are temporarily locked in memory. After the I/O iscomplete, the pages are unlocked but are retained in case the contentsof the file are needed again. The filesystem cache is often one of thebiggest users of memory in a system. Note that in all versions ofSolaris 1 and Solaris 2, the data in the filesystem cache is separatefrom kernel memory. It does not appear in any address space, so thereis no limit on its size. If you want to cache many gigabytes offilesystem data in memory, you just need to buy a 30-Gbyte E6000 or a64-Gbyte E10000. We cached an entire 9-Gbyte database on a 16-GbyteE6000 once, just to show it could be done, and read-intensive queriesran 350 times faster than reading from disk. SPARC-based systems useeither a 36-bit (SuperSPARC) or 41-bit (UltraSPARC) physical address,unlike many other systems that limit both virtual and physical to 32bits and so stop at 4 Gbytes. A 64-bit virtual address space will besupported on UltraSPARC systems via the next major Solaris releaseduring 1998.

Ifyou delete or truncate a file, the cached data becomes invalid, so itis returned to the head of the free list. If the kernel decides to stopcaching the inode for a file, any pages of cached data attached to thatinode are also freed.

Formost files, the only way they become uncached is by having inactivepages stolen by the pageout scanner. The scanner runs only when thefree list shrinks below a threshold (lotsfreein pages, usually set at a few Mbytes), so eventually most of thememory in the system ends up in the filesystem cache. When there is anew, large demand for memory, the scanner will steal most of therequired pages from the inactive files in the filesystem cache. Thesefiles include executable program code files as well as data files.Anything that is currently inactive and not locked in memory will betaken.

Files are also mapped directly by processes usingmmap.This command maps in the code and shared libraries for a process. Pagesmay have multiple references from several processes while also beingresident in the filesystem cache. A recent optimization is that pageswith eight or more references are skipped by the scanner, even if theyare inactive. This feature helps shared libraries and multiprocessserver code stay resident in memory.

Process Private Memory

Processesalso have private memory to hold their stack space, modified dataareas, and heap. Stack and heap memory is always initialized to zerobefore it is given to the process. The stack grows dynamically asrequired; the heap is grown by thebrk system call, usually when themalloclibrary routine needs some more space. Data and code areas areinitialized by mapping from a file, but as soon as they are written to,a private copy of the page is made. The only way to see how muchresident memory is used by private or shared pages is to use the/usr/proc/bin/pmap -x command in Solaris 2.6, ormemtool, described in “Memory Usage and Sizing Tools” on page 319. All that is normally reported is the total number of resident mapped pages, as theRSS field in some forms of ps output (e.g.,/usr/ucb/ps uax). TheSIZE orSZfield indicates the total size of the address space, which includesmemory mapped devices like framebuffers as well as pages that are notresident.SIZE really indicates how much virtual memory the process needs and is more closely related to swap space requirements.

Whena process first starts up, it consumes memory very rapidly until itreaches its working set. If it is a user-driven tool, it may also needto grab a lot more memory to perform operations like opening a newwindow or processing an input file. In some cases, the response time tothe user request is affected by how long it takes to obtain a largeamount of memory. If the process needs more than is currently in thefree list, it goes to sleep until the pageout scanner has obtained morememory for it. In many cases, additional memory is requested one pageat a time, and on a uniprocessor the process will eventually beinterrupted by the scanner as it wakes up to replenish the free list.

Free List Performance Problems and Deficit

Thememory reclaim algorithm can cause problematic behavior. After areboot, the free list is very large, and memory-intensive programs havea good response time. After a while, the free list is consumed by thefile cache, and the page scanner cuts in. At that point, the responsetime may worsen because large amounts of memory are not immediatelyavailable. System administrators may watch this happening withvmstatand see that free memory decreases, and see that when free memory getslow, paging starts and performance worsens. An initial reaction is toadd some more RAM to the system, but this response does not usuallysolve the problem. It may postpone the problem, but it may also makepaging more intense because some paging parameters are scaled up as youadd RAM. The kernel tries to counteract this effect by calculatingrunning averages of the memory demand over 5-second and 30-secondperiods. If the average demand is high, the kernel expects that morememory will be needed in the future, so it sets up a deficit, whichmakes the target size of the free list up to twice as big as normal.This is thede column reported byvmstat.The deficit decays over a few seconds back to zero, so you often see alarge deficit suddenly appear, then decay away again. With the latestkernel code, the target size of the free list (set bylotsfree)increases on big memory systems, and since the deficit is limited tothe same value, you should expect to see larger peak values ofde on large-memory systems.

Thereal problem is that the free list is too small, and it is beingreplenished too aggressively, too late. The simplest fix is to increaselotsfree, but remember that thefree list is unused memory. If you make it too big, you are wastingRAM. If you think that the scanner is being too aggressive, you canalso try reducingfastscan, which is the maximum page scanner rate in pages/s. By default,fastscan is limited to a maximum of 64 Mbytes/s: either 16,384 or 8,192 pages/s, depending upon the page size. By increasinglotsfree,you also increase the maximum value of the deficit. You also have tolet the system stabilize for a while after the first time that the pagescanner cuts in. It needs to go right round the whole of memory once ortwice before it settles down. (vmstat -stells you the number of revolutions it has done). It should then run ona “little and often” basis. As long as you always have enough free RAMto handle the short-term demand and don’t have to scan hard all thetime, performance should be good.

System V Shared Memory

Thereare a few applications that make trivial use of System V shared memory,but the big important applications are the database servers. Databasesbenefit from very large shared caches of data in some cases, and useSystem V shared memory to allocate as much as 3.75 Gbytes of RAM. Bydefault, applications such as Oracle, Informix, and Sybase use aspecial flag to specify that they want intimate shared memory (ISM). Inthis case, two special changes are made. First, all the memory islocked and cannot be paged out. Second, the memory management datastructures that are normally created on a per-process basis are createdonce, then shared by every process. In Solaris 2.6, a furtheroptimization takes place as the kernel tries to find 4-Mbyte contiguousblocks of physical memory that can be used as large pages to map theshared memory. This process greatly reduces memory management unitoverhead. See “Shared Memory Tunables” on page 344.

Filesystem Flush

Unlike SunOS 4 where theupdate process does a full sync of memory to disk every 30 seconds, Solaris 2 uses thefsflush daemon to spread out thesync workload.autoup is set to 30 seconds by default, and this value is the maximum age of any memory-resident pages that have been modified.fsflush wakes up every 5 seconds (set bytune_t_fsflushr)and checks a portion of memory on each invocation (5/30 = one-sixth oftotal RAM by default). The pages are queued on the same list that thepageout daemon uses and are formed into clustered sequential writes. During each invocation, everytune_t_fsflushr seconds,fsflushalso flushes modified entries from inode caches to disk. This processoccurs for all relevant filesystem types and can be disabled by settingdoiflush to zero.

On systems with very large amounts of RAM,fsflush has a lot of work to do on each invocation. Since Solaris 2.4, a special algorithm monitors and limits CPU usage offsflush. The workload can be reduced if necessary.fsflush should still always wake up every few seconds, butautoupcan be increased from 30 seconds to a few hundred seconds if required.In many cases, files that are being written are closed beforefsflush gets around to them. For NFS servers, all writes are synchronous, sofsflush is hardly needed at all. For database servers using raw disk partitions,fsflush will have little useful effect, but its fixed overhead increases as memory size increases. The biggest use offsflushcomes on time-shared systems and systems that do a lot of localfilesystem I/O without using direct I/O or synchronous writes. Note thetime and the CPU usage offsflush, then watch it later and see if its CPU usage is more than 5%. If it is, increaseautoup, as shown in Figure 13-7, or disable page flushing completely by settingdopageflush to zero.

Figure 13-7. Reducingfsflush CPU Usage by Increasingautoup in /etc/system
set autoup=240

Localfiles are flushed when they are closed, but long-lived log files may betruncated in a crash if page flushing is completely disabled. Figure 13-8 shows commands for evaluatingfsflush.

Figure 13-8. Measuringfsflush CPU Usage
# prtconf | head -2 
System Configuration: Sun Microsystems sun4u
Memory size: 21248 Megabytes
# /usr/ucb/ps 3
PID TT S TIME COMMAND
3 ? S 23:00 fsflush
# uptime
12:42pm up 4:44, 1 user, load average: 0.03, 0.04, 0.04
# /opt/RICHPse/bin/se pwatch.se
fsflush has used 1382.1 seconds of CPU time since boot
fsflush used 8.3 %CPU, 8.3 %CPU average so far
fsflush used 8.3 %CPU, 8.3 %CPU average so far

Since boot on this 20-Gbyte E10000,fsflush has used 1366 seconds of CPU time, which is 8.3% of a CPU. This usage is not unreasonable, but the system is idle andfsflush CPU usage will increase when it gets busy. Systems with less memory will show proportionally less CPU usage.

Scan Rate Threshold Indicating a Memory Shortage

By default,fsflushmakes sure that any modified page is flushed to its backing store after30 seconds. The pageout scanner is also scanning memory looking forinactive pages at a variable rate. If that rate corresponds to morethan 30 seconds, thenfsflush will havegot there first and flushed the page. If the page is inactive, it canbe freed immediately. If the scan rate corresponds to less than 30seconds, then many inactive pages will still be modified and unflushedand will need a page-out operation to flush them before they can be puton the free list. The system will run better if the page-out scan rateis slower than thefsflush scan rate because freed pages will be available immediately. My SE toolkit rule for memory shortage, described in “RAM Rule” on page 456, divides the average scan rate into the value ofhandspreadpagesto obtain the idle page residence time. The thresholds used are 40seconds as an amber warning level, and 20 seconds as a red problemlevel. The rate at whichfsflush works, given byautoupand defaulting to 30 seconds, is the unflushed page delay time. Ifthere is a memory shortage on a system with a small amount of memory,it may be a good idea to try reducingautoup to 10 or 20 seconds.

The value ofhandspreadpagesis clamped at 64 Mbytes, or a quarter of all memory. For UltraSPARCsystems with 256 Mbytes or more of memory, the value will be fixed at8192. This value gives a simpler threshold to watch for—a scan rate ofabove 300 pages per second indicates a memory shortage.


The 30-second memory residence time threshold is not just based on the interaction withfsflush. It is a generally accepted threshold for sizing disk caches on mainframe systems, for example. Ifautoup has been increased on a system becausefsflushis not necessary, then the scan rate residence time should still be 30seconds. In the case of NFS servers and database servers, writes aresynchronous, so most filesystem pages in memory should be unmodifiedand can be freed immediately.

Kernel Values, Tunables, and Defaults

Inthis section I describe the most important kernel variables that can beused to tune the virtual memory system. They are normally set in the/etc/system file. Figure 13-9 illustrates the relationship between the parameters that control page scanning.

Figure 13-9. Parameters to Control Page Scanning Rate and Onset of Swapping


fastscan

The fastest scan rate, corresponding to an empty free list. It can be reduced iflotsfree has been increased, but it must be more thanslowscan.fastscan can be changed on line.fastscan is set to (physmem/2) in Solaris 2.1 through 2.3 and (physmem/4) with a limit of 64 Mbytes (8,192 or 16,384 pages) since Solaris 2.4.

slowscan

Theinitial scan rate at the point where scanning first starts. It defaultsto 100, but some recent tests seem to indicate that refixing it ataround 500 is beneficial. The idea of a minimum is that it is onlyworth waking up the scanner if it has a reasonable amount of work todo. Higher values ofslowscan cause the pager to run less often and do more work each time.slowscan can be changed on line.

physmem

Set to the number of pages of usable physical memory. Themaxusers calculation is based uponphysmem, as described in “Parameters Derived from maxusers” on page 359. If you are investigating system performance and want to run tests with reduced memory on a system, you can setphysmem in/etc/systemand reboot to prevent a machine from using all its RAM. The unusedmemory still uses some kernel resources, and the page scanner stillscans it, so if the reduced memory system does a lot of paging, theeffect will not be exactly the same as physically removing the RAM.

lotsfree

The target size of the free list in pages. Set tophysmem/16in Solaris 2.3, it was too large and memory was being wasted. It wasoriginally fixed at 128 pages on desktop machines, and 256 pages onservers (sun4d) in Solaris 2.4 and 2.5. This allocation turned out tobe too small. For Solaris 2.5.1, it is scaled again, tophysmem/64(with a minimum of 512 Kbytes), and there are many other changes tomake the algorithm more robust. This code has been backported to allthe current kernel jumbo patches for previous releases.lotsfreecan be changed on line with immediate effect if you know what you aredoing and are careful. During the 100 Hz clock routine, the kerneltests four times a second to see iffreemem is less thanlotsfree. If so, a wakeup is sent to thepageout daemon.

desfree

The desperation threshold. It is set to half the value oflotsfree.desfree is used for the following purposes:

  • If a 30-second average of free memory is less than desfree, then inactive processes will be swapped out.

  • At the point where pages are taken from the free list, if freemem is less than desfree, then an immediate wakeup call is sent to the pageout daemon, rather than waiting for pageout to be awakened by the clock interrupt.

minfree

The minimum threshold. It is set to half of the value ofdesfree.minfree is used for the following purposes:

  • If the short-term (5-second) average free memory is less than minfree, then processes will be swapped out.

  • When exec is running on a small program (under 280 Kbytes), the entire program is loaded in at one time rather than being paged in piecemeal, as long as doing so would not reduce freemem below minfree.

throttlefree

Suspends processes that are trying to consume memory too quickly. It is set the same asminfree by default and must be less thandesfree.throttlefreepenalizes processes that are consuming additional memory and favorsexisting processes that are having memory stolen from them. Iffreemem reachesthrottlefree, any process that tries to get another page is suspended until memory goes back abovedesfree.

handspreadpages

Set to (physmem/4) but is increased during initialization to be at least as big asfastscan, which makes it (physmem/2) for Solaris 2.1 through 2.3. It is usually set to the same value asfastscan. In Solaris 2, it is limited to 64 Mbytes, likefastscan. Note that the definition of this variable changed fromhandspread, measured in bytes in SunOS 4, tohandspreadpages, measured in pages in Solaris 2.

max_page_get

Setto half the number of pages in the system. It limits the maximum numberof pages that can be allocated in a single operation. In somecircumstances, a machine may be sized to run a single, very largeprogram that has a data area or singlemalloc space of more than half the total RAM. It will be necessary to increasemax_page_get in that circumstance. Ifmax_page_get is increased too far and reachestotal_pages (a little less thanphysmem), then deadlock can occur and the system will hang trying to allocate more pages than exist.

maxpgio

Themaximum number of page-out I/O operations per second that the systemwill schedule. The default is 40 in older releases, which is set toavoid saturating random access to a single 3600 rpm (60 rps) disk attwo-thirds of the rotation rate.maxpgiois set to 60 for sun4d kernels only; it should be increased if more orfaster disks are being used for the swap space. Many systems now have7200 rpm (120 rps) disks, somaxpgio should be set to about 100 times the number of swap disks in use. See Table 8-4 on page 204 for disk specifications. The value is divided by four during system initialization since thepageout daemon runs four times per second; the resulting value is the limit on the number of page-outs that thepageoutdaemon will add to the page-out queue in each invocation. Note that inaddition to the clock-based invocations, an additional invocation willoccur whenever more memory is allocated andfreemem is less thandesfree, so more thanmaxpgio pages can be queued per second when a lot of memory is allocated in a short time period. Changes tomaxpgio only take effect after a reboot, so it cannot be tweaked on a running system.

tune_t_fsflushr and autoup

Described in “Filesystem Flush” on page 333.

tune_t_gpgslo

A feature derived from Unix System V.3. Since Solaris 2.4, it is no longer used.

Swap Space

Forall practical purposes, swapping can be ignored in Solaris 2. Thetime-based soft swapouts that occur in SunOS 4 are no longerimplemented.vmstat -s willreport total numbers of swap-ins and swap-outs, which are almost alwayszero. Prolonged memory shortages can trigger swap-outs of inactiveprocesses. Swapping out idle processes helps performance of machineswith less than 32 Mbytes of RAM. The number of idle swapped-out processes is reported as the swap queue length byvmstat. This measurement is not explained properly in the manual page since that measure used to be the number of activeswapped-out processes waiting to be swapped back in. As soon as aswapped-out process wakes up again, it will swap its basic datastructures back into the kernel and page in its code and data as theyare accessed. This activity requires so little memory that it canalways happen immediately.

Figure 13-10. Examplevmstat Output Highlighting Swap Queue
Code View: Scroll / Show All
% vmstat 5 
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s1 s5 in sy cs us sy id
...
0 0 4 314064 5728 0 7 2 1 1 0 0 0 177 91 22 132 514 94 1 0 98


If you come across a system with a non-zero swap queue reported byvmstat,it is a sign that at some time in the past, free memory stayed low forlong enough to trigger swapping out of idle processes. This is the onlyuseful conclusion you can draw from his measure.

Swap Space Operations

Swapspace is really a misnomer of what really is the paging space. Almostall the accesses are page related rather than being whole-processswapping.

Swap spaceis allocated from spare RAM and from swap disk. The measures providedare based on two sets of underlying numbers. One set relates tophysical swap disk, the other set relates to RAM used as swap space bypages in memory.

Swap space is used in two stages. When memory is requested (for example, via amalloc call) swap is reserved and a mapping is made against the/dev/zerodevice. Reservations are made against available disk-based swap tostart with. When that is all gone, RAM is reserved instead. When thesepages are first accessed, physical pages are obtained from the freelist and filled with zeros, and pages of swap become allocated ratherthan reserved. In effect, reservations are initially taken out ofdisk-based swap, but allocations are initially taken out of RAM-basedswap. When a page of anonymous RAM is stolen by the page scanner, thedata is written out to the swap space, i.e., the swap allocation ismoved from memory to disk, and the memory is freed.

Memoryspace that is mapped but never used stays in the reserved state, andthe reservation consumes swap space. This behavior is common for largedatabase systems and is the reason why large amounts of swap disk mustbe configured to run applications like Oracle and SAP R/3, even thoughthey are unlikely to allocate all the reserved space.

Thefirst swap partition allocated is also used as the system dump space tostore a kernel crash dump into. It is a good idea to have plenty ofdisk space set aside in/var/crash and to enablesavecore by uncommenting the commands in/etc/rc2.d/S20sysetup. If you forget and think that there may have been an unsaved crash dump, you can try runningsavecore long after the system has rebooted. The crash dump is stored at the very end of the swap partition, and thesavecore command can tell if it has been overwritten yet.

Swap Space Calculations

Pleaserefer to Jim Mauro’s Inside Solaris columns in SunWorld Online December1997 and January 1998 for a detailed explanation of how swap spaceworks. See http://www.sun.com/sunworldonline/swol-01-1998/swol-01-insidesolaris.html.

Disk space used for swap is listed by theswap -l command. All swap space segments must be 2 Gbytes or less in size. Any extra space is ignored. Theanoninfo structure in the kernel keeps track of anonymous memory. In Solaris 2.6 this structure changed its name tok_anoninfo, but these three values are the same. This illustrates why it is best to rely on the more stablekstat interface rather than the raw kernel data. In this case, the data provided is so confusing that I felt I had to see how thekstat data is derived.

anoninfo.ani_max is the total amount of disk-based swap space.

anoninfo.ani_resv is the amount reserved thus far from both disk and RAM.

anoninfo.ani_free is the amount of unallocated physical space plus the amount of reserved unallocated RAM.

Ifani_resv is greater thanani_max,then we have reserved all the disk and reserved some RAM-based swap.Otherwise, the amount of disk-resident swap space available to bereserved isani_max minusani_resv.

swapfs_minfree is set tophysmem/8 (with a minimum of 3.5 Mbytes) and acts as a limit on the amount of memory used to hold anonymous data.

availrmem is the amount of resident, unswappable memory in the system. It varies and can be read from thesystem_pages kstat shown in Figure 13-11.

The amount of swap space that can be reserved from memory isavailrmem minusswapfs_minfree.

The total amount available for reservation is thus MAX(ani_max - ani_resv, 0) + (availrmem - swapfs_minfree). A reservation failure will prevent a process from starting or growing. Allocations are not really interesting.

The counters provided by the kernel to commands such asvmstat andsar are part of thevminfokstat structure. These counters accumulate once per second, so average swap usage over a measurement interval can be determined. Theswap -s command reads the kernel directly to obtain a snapshot of the currentanoninfovalues, so the numbers will never match exactly. Also, the simple actof running a program changes the values, so you cannot get an exactmatch. Thevminfo calculations are:

swap_resv +=ani_resv

swap_alloc += MAX(ani_resv,ani_max) -ani_free

swap_avail += MAX(ani_max -ani_resv, 0) + (availrmem -swapfs_minfree)

swap_free +=ani_free + (availrmem -swapfs_minfree)

Figure 13-11. Examplesystem_pages kstat Output fromnetstat -k
Code View: Scroll / Show All
system_pages: 
physmem 15778 nalloc 7745990 nfree 5600412 nalloc_calls 2962 nfree_calls 2047
kernelbase 268461504 econtig 279511040 freemem 4608 availrmem 13849 lotsfree 256
desfree 100 minfree 61 fastscan 7884 slowscan 500 nscan 0 desscan 125
pp_kernel 1920 pagesfree 4608 pageslocked 1929 pagesio 0 pagestotal 15769


To figure out how the numbers really do add up, I wrote a short program in SE and compared it to the example data shown in Figure 13-12. To get the numbers to match, I needed some odd combinations forsar andswap -s. In summary, the only useful measure isswap_available, as printed byswap -s,vmstat, andsar -r (althoughsar labels itfreeswap and before Solaris 2.5sar actually displayedswap_free rather thanswap_avail). The other measures are mislabeled and confusing. The code for the SE program in Figure 13-13 shows how the data is calculated and suggests a more useful display that is also simpler to calculate.

Figure 13-12. Example Swap Data Calculations
Code View: Scroll / Show All
# se swap.se 
ani_max 54814 ani_resv 19429 ani_free 37981 availrmem 13859 swapfs_minfree 1972
ramres 11887 swap_resv 19429 swap_alloc 16833 swap_avail 47272 swap_free 49868

Misleading data printed by swap -s
134664 K allocated + 20768 K reserved = 155432 K used, 378176 K available
Corrected labels:
134664 K allocated + 20768 K unallocated = 155432 K reserved, 378176 K available

Mislabelled sar -r 1
freeswap (really swap available) 756352 blocks

Useful swap data: Total swap 520 M
available 369 M reserved 151 M Total disk 428 M Total RAM 92 M
# swap -s
total: 134056k bytes allocated + 20800k reserved = 154856k used, 378752k available
# sar -r 1
18:40:51 freemem freeswap
18:40:52 4152 756912


The only thing you need to know about SE to read this code is that readingkvm$name causes the current value of the kernel variablename to be read.

Figure 13-13. SE Code to Read Swap Measures and Display Correctly Labeled Data
Code View: Scroll / Show All
/* extract all the swap data and generate the numbers */ 
/* must be run as root to read kvm variables */

struct anon {
int ani_max;
int ani_free;
int ani_resv;
};

int max(int a, int b) {
if(a>b){
return a;
} else {
return b;
}
}

main() {
#if MINOR_RELEASE < 60
anon kvm$anoninfo;
#else
anon kvm$k_anoninfo;
#endif
anon tmpa;
int kvm$availrmem;
int availrmem;
int kvm$swapfs_minfree;
int swapfs_minfree;
int ramres;
int swap_alloc;
int swap_avail;
int swap_free;
int kvm$pagesize;
int ptok = kvm$pagesize/1024;
int res_but_not_alloc;
#if MINOR_RELEASE < 60
tmpa = kvm$anoninfo;
#else
tmpa = kvm$k_anoninfo;
#endif
availrmem = kvm$availrmem;
swapfs_minfree = kvm$swapfs_minfree;

ramres = availrmem - swapfs_minfree;
swap_alloc = max(tmpa.ani_resv, tmpa.ani_max) - tmpa.ani_free;
swap_avail = max(tmpa.ani_max - tmpa.ani_resv, 0) + ramres;
swap_free = tmpa.ani_free + ramres;
res_but_not_alloc = tmpa.ani_resv - swap_alloc;

printf("ani_max %d ani_resv %d ani_free %d availrmem %d swapfs_minfree %d\n",
tmpa.ani_max, tmpa.ani_resv, tmpa.ani_free,
availrmem, swapfs_minfree);
printf("ramres %d swap_resv %d swap_alloc %d swap_avail %d swap_free %d\n",
ramres, tmpa.ani_resv, swap_alloc, swap_avail, swap_free);

printf("\nMisleading data printed by swap -s\n");
printf("%d K allocated + %d K reserved = %d K used, %d K available\n",
swap_alloc * ptok, res_but_not_alloc * ptok,
tmpa.ani_resv * ptok, swap_avail * ptok);
printf("Corrected labels:\n");
printf("%d K allocated + %d K unallocated = %d K reserved, %d K available\n",
swap_alloc * ptok, res_but_not_alloc * ptok,
tmpa.ani_resv * ptok, swap_avail * ptok);

printf("\nMislabelled sar -r 1\n");
printf("freeswap (really swap available) %d blocks\n",
swap_avail * ptok * 2);

printf("\nUseful swap data: Total swap %d M\n",
swap_avail * ptok / 1024 + tmpa.ani_resv * ptok / 1024);
printf("available %d M reserved %d M Total disk %d M Total RAM %d M\n",
swap_avail * ptok / 1024, tmpa.ani_resv * ptok / 1024,
tmpa.ani_max * ptok /1024, ramres * ptok / 1024);
}


Overthe years, many people have struggled to understand the Solaris 2 swapsystem. When people try to add up the numbers from the commands, theyget even more confused. It’s not their fault. It really is confusing,and the numbers don’t add up unless you know how they were calculatedin the first place!

System V Shared Memory, Semaphores, and Message Queues

I’m listing these interprocess communication tunables all together because they are often set up together in/etc/system. This set of descriptions is based on a listing put together by Jim Mauro. The default and maximum values are tabulated in Table A-7 on page 559.

Shared Memory Tunables

As described in “System V Shared Memory” on page 333, System V shared memory is mostly used by database applications.

shmmax

The maximum size of a shared memory segment. The largest value a program can use in a call toshmget(2). Setting this tunable to a high value doesn’t impact anything because kernel resources are not preallocated with this value.

shmmin

The smallest possible size of a shared memory segment. It is the smallest value that a program can use in a call toshmget(2). The default is 1 byte; there should never be any reason to changeshmmin.

shmmni

Themaximum number of shared memory identifiers that can exist in thesystem at any point in time. Every shared segment has an identifierassociated with it, and this is whatshmget(2)returns. The number of shared memory identifiers you need depends onthe application. How many shared segments do we need? Settingshmmni too high has some fallout since the system uses this value during initialization to allocate kernel resources. Specifically, ashmid_ds structure is created for each possibleshmmni: thus, the kernel memory allocated equalsshmmni timessizeof(struct shmid_ds). Ashmid_ds structure is 112 bytes; you can do the arithmetic and determine the initial overhead in making this value large.

shmseg

Themaximum number of segments per process. We don’t allocate resourcesbased on this value; we simply keep a per-process count of the numberof shared segments the process is attached to, and we check that valueto ensure it is less thanshmseg before we allow another attach to complete. The maximum value should be the current value ofshmmni, since setting it greater thanshmniis pointless, and it should always be less than 65535. It isapplication dependent. Ask yourself, “How many shared memory segmentsdo the processes running my application need to be attached to at anypoint in time?”

Semaphores

Semaphoretunables are not as easy to understand as those of shared segmentsbecause of the complex features of semaphores, such as the ability touse a single semaphore, or several semaphores in a set.

semmap

Thenumber of entries in the semaphore map. The memory space given to thecreation of semaphores is taken from the semaphore map, which isinitialized with a fixed number of map entries based on the value ofsemmap. The implementation of allocation maps is generic within SVR4, supported with a standard set of kernel routines (rmalloc(),rmfree(),etc.). The use of allocations maps by the semaphore subsystem is justone example of their use. They prevent the kernel from having to dealwith mapping additional kernel memory as semaphore use grows. Byinitializing and using allocation maps, kernel memory is allocated upfront, and map entries are allocated and freed dynamically from thesemmap allocation maps.semmap should never be larger thansemmni. If the number of semaphores per semaphore set used by the application is known and we call that number “n,” then you can use:

semmap = ((semmni + n - 1) / n) + 1

If you makesemmap too small for the application, you’ll get: “WARNING: rmfree mapoverflow” messages on the console. Set it higher and reboot.

semmni

Themaximum number of semaphore sets, systemwide; the number of semaphoreidentifiers. Every semaphore set in the system has a unique identifierand control structure. During initialization, the system allocateskernel memory forsemmni control structures. Each control structure is 84 bytes, so you can calculate the result of makingsemmni large.

semmns

Themaximum number of semaphores in the system. A semaphore set may havemore than one semaphore associated with it, and each semaphore has asem structure. During initialization the system allocatessemmns *sizeof(struct sem) out of kernel memory. Eachsem structure is only 16 bytes. You should setsemmns tosemmni timessemmsl.

semmnu

The systemwide maximum number of undo structures. It seems intuitive to make this equal tosemmni; doing so would provide for an undo structure for every semaphore set. Semaphore operations performed viasemop(2)can be undone if the process should terminate for whatever reason. Anundo structure is required to guarantee this operation.

semmsl

Themaximum number of semaphores, per unique identifier. As mentionedpreviously, each semaphore set may have one or more semaphoresassociated with it. This tunable defines what the maximum number is perset.

semopm

The maximum number of semaphore operations that can be performed persemop(2) call.

semume

The maximum per process undo structures. See if the application sets theSEM_UNDO flag when it gets a semaphore. If not, you don’t need undo structures.semume should be less thansemmnu but sufficient for the application. Set it equal tosemopm times the average number of processes that will be doing semaphore operations at any time.

semusz

Although listed as the size in bytes of the undo structure, in realitysemusz is the number of bytes required for the maximum configured per-process undo structures. During initialization, it gets set tosemume * (1 +sizeof(undo)), so setting it in/etc/system is pointless. It should be removed as a tunable.

semvmx

The maximum value of a semaphore. Because of the interaction with undo structures (andsemaem), this tunable should not exceed a max of its default value of 32767 unless you can guarantee thatSEM_UNDO is never being used.

semaem

Themaximum adjust-on-exit value. A signed short, because semaphoreoperations can increase or decrease the value of a semaphore, eventhough the actual value of a semaphore can never be negative.semaem needs to represent the range of changes possible in a single semaphore operation, which limitssemvmx, as described above. We should not be tweaking eithersemvmx orsemaem unless we really understand how the applications will be using summaries. And even then, leavesemvmx as the default.

Message Queues

System V message queues provide a standardized message-passing system that is used by some large commercial applications.

msgmap

The number of entries in the msg map. Seesemmap above. The same implementation of preallocation of kernel space by use of resource maps applies.

msgmax

The maximum size of a message.

msgmnb

The maximum number of bytes on the message queue.

msgmni

The number of message queue identifiers.

msgssz

The message segment size.

msgtql

The number of system message headers.

msgseg

The number of message segments.