[Sauers04] 8.2. Memory Organization

来源:百度文库 编辑:神马文学网 时间:2024/04/30 10:22:11

8.2. Memory Organization

Theway in which both virtual and physical memory is organized can affectthe overall performance of the system. Some of the hardwareorganizations of memory were discussed in Section 6.5, “Main Memory Performance Issues” on page 142.This section discusses how HP-UX organizes the virtual memory intodifferent page sizes to make better use of the processor's TLBs and howit chooses to allocate memory on cell-based machines to improve thememory latency and overall system throughput.

8.2.1. Variable Page Sizes

PA-RISC2.0 and IPF-based systems typically have smaller TLBs than earlyPA-RISC systems that had very large off-chip TLBs. The small TLBs are anecessity for newer processors, since the CPU chip is a very expensivecomponent, CPU chip area is a scarce resource, and TLB access time canlimit the processor's frequency. compared to large TLBs, small TLBsconsume less space on the CPU chip and can be accessed quickly. A smallTLB, however, means that fewer virtual to physical address mappings canbe held at one time, which could increase TLB misses.

Tocomplicate matters, PA-RISC 2.0 chips don't have a block TLB and alsodon't have a hardware TLB walker, so TLB misses are especially costlyon these chips. To offset the loss in TLB entries for PA-RISC 2.0 andIPF processors, each entry was allowed to reference different sizepages (see Table 8-2),whereas in earlier designs, only a fixed-size page size could be used.Being able to map different sized pages is referred to as variable pagesize mapping.

InHP-UX revisions 10.20 and later, the operating system takes advantageof the variable pagesize feature when creating memory objects. Usingvariable pages allows for more efficient use of the processor's TLB andthus better performance due to reduced TLB misses. In addition, the useof variable-sized pages allows the operating system to represent memorymore compactly. Table 8-2shows the pagesizes allowed for various PA-RISC and IPF processors.Given the very large mappings (up to 4 GB on Itanium® 2-based systems),even a small number of TLB entries can map huge amounts of virtualaddress space. HP-UX supports all of the processor pagesizes except forthe 8KB page sizes on the IPF processors.

Table 8-2. TLB Pagesizes Supported on PA-RISC and IPF Processors
Processor Pagesizes PA-8000

PA-8200

PA-8500 4 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB, 16 MB, 64 MB PA-8600 PA-8700 4 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB, 16 MB, 64 MB, 256 MB, 1 GB Itanium ® 4 KB, 8 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB, 16 MB, 64 MB, 256 MB Itanium® 2 4 KB, 8 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB, 16 MB, 64 MB, 256 MB, 1 GB, 4 GB

Theability of HP-UX to handle mapping variable-sized pages has evolvedover the years. Starting with HP-UX 10.20, variable-sized pages wereonly used for shared memory that was locked into main memory, or fortext regions that were locked into main memory. These restrictions wereapplied due to the complexities in modifying HP-UX to handle pagingvariable-sized pages from memory to disk. Most databases, however, hadthe ability to lock both their shared memory segments and textsegments, so large performance gains could be achieved in HP-UX 10.20even with this limited model.

InHP-UX 11.00, the variable page implementation became much moreelaborate. Variable-sized pages can now be used for almost any area ofmemory and there are no restrictions that require pages to be lockedinto memory (although locking pages in memory results in HP-UX usingthe largest pages available on a given system, which may providefurther performance advantage over default HP-UX variable page sizesettings). Variable-sized pages are used in the kernel, for theprocess's text regions, data regions, stacks, u_area,shared memory, and anonymous memory-mapped regions. Only for HP-UX 11iv1.6 and later are lazy swap memory objects able to use variable-sizedpages. Variable pages are not used for the buffer cache or for regularmemory-mapped files due to complexities involving remapping I/O. HP-UXhas the most advanced variable page implementation of all commercialUNIX™ operating systems.

Ingeneral, variable-sized pages only help improve performance, given thelarge time penalty that TLB misses can entail. However, variable-sizedpages are not without a disadvantage. Because physical memory isallocated and assigned to a process in pagesize units, overuse ofvariable-sized pages can result in internalmemory fragmentation within a page. Internal fragmentation can happenwhen the operating system allocates a larger page, but the applicationonly references part of the page, whether it is code or data. Had asmaller page been used, the application may have more fully utilizedthe entire page. This internal page fragmentation can result inunder-utilization of memory, and thus more physical memory is needed toperform a specific task. This problem can be more pronounced forsystems that create thousands of processes. For systems running asingle large process (such as engineering and scientific applications),internal memory fragmentation is usually less of an issue.

In addition to internal fragmentation, there can also be externalmemory fragmentation caused by many processes allocating several smallpages. When several smaller pages are allocated, the free memory canbecome fragmented, creating few large contiguous memory regions fromwhich larger pages can be allocated.

HP-UXprovides several ways to adjust how it uses variable-sized pages. Bydefault, HP-UX will only use pages up to 16 KB in size. The defaultbehavior can be modified globally via kernel-tuneable parameters and ona per-program basis via the chatr(1) command.

8.2.1.1. HP-UX Tuneables To Control Pagesizes
  • The vps_ceiling tuneable parameter can be set to adjust the maximum pagesize used to anywhere from 4 KB to the maximum supported pagesize for the given platform (see Table 8-2). The default for this value is 16 KB. The vps_pagesize tuneable parameter can be used to set the minimum pagesize that HP-UX will use. The default for this tuneable is 4 KB. HP-UX will choose pages in the range vps_pagesize to vps_ceiling that are based on the memory object size.

  • The default value of vps_ceiling is set conservatively to minimize memory fragmentation. However, with large memory setups (more than 1 GB of memory), setting this value to 1024 (1 MB) or larger may result in significant speed up of many applications while not greatly increasing memory usage.

  • In general, the vps_pagesize tuneable should probably not be modified. Setting the vps_pagesize parameter larger than the default of 16 KB may result in significant internal page fragmentation. For instance, if the vps_pagesize was set to 256 KB, but only 64 KB of data needed to be allocated, the operating system would still use a 256 KB page to allocate the 64 KB of data, wasting 192 KB of memory. Even setting vps_pagesize to something like 16 KB can result in wasted memory due to the page fragmentation issue. However, for systems with multiple gigabytes of memory and lots of processes, setting vps_pagesize to 16 KB may improve overall performance due to potentially using fewer TLB entries, but most likely this will increase the amount of memory needed to run all of the applications. Unfortunately, experimentation is the only way to determine if there is a significant performance increase with an acceptable memory usage increase.

  • Finally, the vps_chatr_ceiling tuneable parameter limits the maximum pagesize allowed by any individual program that may have specified a maximum pagesize via the chatr(1) command. This is a system resource management tuneable parameter that is intended to limit the ability of individual programs to set their own maximum pagesize above a specific value. In general, this tuneable should not be lowered from its default value of the maximum available pagesize on a given platform. The reason this value should not be lowered is that many important database and technical applications will specify large maximum pagesizes via the chatr(1) command, and lowering vps_chatr_ceiling would make these programs run slower.

8.2.1.2. Setting Pagesize With Chatr or the Linker

The maximum pagesize allowed for a specific executable can be modified via the chatr(1)command or via equivalent linker options. The maximum pagesize for boththe text segment and data segment can be manipulated individually.

The +pi chatr(1) option (and linker option) sets the text segment maximum pagesize, while the +pd chatr(1) option (and linker option) sets the data segment maximum pagesize. The value used with the +pdoption also controls the pagesize for the stack, shared memorysegments, and memory-mapped regions that are directly created by theexecutable.

Forapplications that typically access gigabytes of data via shared memory,memory mapped files, or directly via the heap, we recommend that youset the data segment's maximum pagesize to the value of “L,” which willinstruct the operating system to use up to the largest availablepagesize supported on the given hardware. One caveat with using “L” isthat additional memory consumption may occur when performing somethinglike a processor upgrade if the newer processor supports a largerpagesize than the previous processor due to fragmentation issues.

Forthe text page size, typically a value between 1 MB and 16 MB works welldepending on the size of the executable's text segment. The chatr(1)size used for text should be the largest page equal to the text size orjust smaller than the text size. For example, an executable with a 12MB text segment could use the +pi 4 MB option to chatr(1)to set the maximum text pagesize to 4 MB. This would most likely resultin the operating system choosing to map the text segment using three 4MB pages, which would be very efficient and waste little memory, giventhat the text segment is normally a shared memory object in HP-UX.

Dependingon what is being allocated and the process's access pattern, theoperating system will choose page sizes up to the limits allowed by thesystem and chatr(1) tunables. The chatr(1) tuneables will have precedence over the system tuneable vps_ceiling, so an individual executable can obtain pages larger than vps_ceiling if it has individual segments chatr'ed to a pagesize larger than vps_ceiling. Maximum chatr(1) pagesizes are limited, however, by the system tuneable vps_chatr_ceiling.

8.2.1.3. General Pagesize Information

Thesize of memory allocated has a large effect on the number and size ofpages allocated by the operating system. For example, if a 1 MB sharedmemory segment is created, the operating system will attempt toallocate a single 1 MB page for this segment if values of the tunableparameters allow. However, if a 1020 KB (4 KB less than 1 MB) segmentis allocated instead, then the operating system would probably allocatethree 256 KB pages, three 64 KB pages, three 16 KB pages, and three 4KB pages for a total of twelve pages. So, rounding up memory allocationto a larger multiple pagesize value can help reduce the total number ofpages that the operating system must allocate, but it will create someamount of internal fragmentation.

Fordynamic memory allocation such as from the heap or stack, HP-UX usesalgorithms to gradually increase the pagesize used based on the amountof activity seen in the recent past, as well as based on the sizerequested. When there is severe memory pressure, the memory managementsystem may be forced to use smaller pages due to a lack of free largepages. The memory management system will tend to page-out smaller pagesto make room for new memory requests versus larger pages, given thatlarger pages may tend to be referenced more often than smaller pagesbecause of the larger amount of data contained in a large page.

Internal memory fragmentation for a particular executable can be controlled via the chatr(1)command to explicitly set a given application's maximum pagesize tosomething smaller than the system default. For example, setting the +pi and +pdoptions to both 4 KB will cause an application to use only 4 KB pages,which will minimize the memory usage, but probably cause severe TLBpressure. System memory fragmentation in general can be controlled viathe vps_pagesize and vps_ceilingparameters. Keeping these at the system defaults will help reducememory fragmentation. In general, however, systems are now configuredwith larger amounts of physical memory. The cost of this additionalphysical memory compared to the cost of taking TLB misses andsignificantly slowing the performance of the application is usuallyworth the extra price paid for the additional memory. So, using largepagesizes and potentially increasing the cost of the system due to theneed for additional memory is usually not a bad trade-off.

8.2.2. Cell Local Memory

As described in Section 6.7.11, “Superdome” on page 160,several HP-UX servers (such as Superdome) are configured using a cellarchitecture. These cell-based systems distribute memory among all ofthe cells in the system, compared to locating memory in one centrallocation. When a processor must access memory residing in a remotecell, the latency to access the memory is longer than accessing memorylocated in the local cell.

BeforeHP-UX 11i v2 (11.23), all of the memory on these systems wascache-line-interleaved among the cells. Cache line interleaving placesthe first cache line of a memory page on one cell, the second cacheline on the next cell, and so on until all cells contain one cacheline, and then the process repeats. This interleaving has the effect ofscattering memory accesses among all cells. This scattering makes theoverall memory latency look uniform even though accessing memory ondistant cells takes longer than accessing memory on a local cell.

Inthe global sense, for general purpose systems, it is a good design toscatter the cache lines across the cells, because processes may migratefrom processor to processor and cell to cell for load balancingpurposes. This has the effect of evening out the time spent inaccessing memory among all of the processes on the system. For thosesystem administrators who cannot or do not wish to tune memory accessesby processes, this is probably a good compromise.

With the release of HP-UX 11i v2 and the IPF versions of the cell-based servers, the concept of cell local memory (CLM) was introduced. All of the processors, memory, and I/O residing on a given cell constitute a locality domain.The firmware on the IPF cell-based servers was modified to allow partof the memory to be interleaved between cells, as before, and part ofthe memory to stay local to each cell. The HP-UX virtual memory systemwas then modified in HP-UX 11i v2 to allocate memory from either theinterleaved pool or the cell local pool, depending on what type ofmemory needed to be allocated, or based on parameters passed to varioussystem calls.

8.2.2.1. Performance Aspects of CLM

UsingCLM can result in significant performance improvements for two reasons.First, accessing CLM typically can be up to twice as fast as accessingthe farthest remote memory. If memory requests can be serviced faster,then the CPU has to wait less on memory misses and can be utilized moreefficiently. Second, the aggregate memory bandwidth within theindividual cells is typically much higher than the overall systembandwidth achievable when communicating between cells. This meansmemory-intensive applications can simultaneously access CLM at a muchhigher rate than if they all accessed memory interleaved among thecells. This typically leads to much higher scalability formemory-intensive applications on systems with large numbers of cells.This increased bandwidth and decreased latency associated with CLM canresult in large performance gains (up to 2-3 times) depending on theapplication. In particular, applications that access a lot of memorybenefit the most from CLM because these applications tend to have morecache misses, given the huge amounts of data being accessed. Manytechnical applications benefit from CLM, as do database applicationssuch as Oracle.

UsingCLM, however, can also negatively affect system performance if it isnot used properly. For instance, if all memory for a process wasallocated locally from a particular cell and the process migrated toanother cell, the resulting memory accesses would always be remote, andthe overall memory latency could be worse than the default ofcell-interleaved memory. In addition, if some piece of memory is sharedamong multiple processes residing in multiple cells, but the memory wasallocated local to a single cell, a bottleneck can form on the cellcontaining the memory, as all misses for the piece of memory need to beserviced from a single cell. HP-UX is not able to help out in thesesituations because it currently purposely does not migrate memory pagesbetween cells once the memory is allocated, so CLM will mostly residein one cell for the life of a process. One exception to this, however,is that if a piece of memory is paged out due to memory pressure, itmay be paged back in on a different cell.

8.2.2.2. Configuring CLM

Currently,CLM is not enabled by default in HP-UX. This a conservative default,since not all applications may see a benefit, and choosing too muchcell-local memory may actually result in worse performance than usingall interleaved memory (the default), considering the negativeperformance aspects described above.

To specify how memory should be distributed between the cell-interleaved memory and cell local memory, the parmgr utility, parcreate(1), or parmodify(1)commands can be used. A good starting point is probably 20% of memoryin each cell should be cell local. Depending on the application, onemay want more or less than this amount. Highly parallel technicalapplications may benefit from more cell local memory because they tendto have large amounts of private data with high numbers of cachemisses. Applications that make extensive use of shared memory maybenefit less from cell local memory because shared memory is allocatedfrom the interleaved pool by default.

Notethat not all memory can be made cell local. There needs to be somememory that is interleaved on the system for use by the operatingsystem. However, not all cells need to use interleaved memory, so a fewcells could use interleaved memory, while others could use cell localonly. There is one performance aspect, however, of allocating too fewcells for the interleaved pool. Most of the kernel memory is allocatedfrom the interleaved pool, so having too few cells allocated tointerleaved memory can result in memory bottlenecks for applicationsthat spend a significant amount of time in the operating system, suchas databases.

8.2.2.3. Using CLM

Oncesome amount of CLM has been configured, the operating system will startusing it. By default, process private data, such as the stack and heap,will always be allocated from cell local memory, while shared data,such as text or shared memory, will be allocated from interleavedmemory. Cell local memory is used for process private data becausetypically this data is not shared among other processes. Amulti-threaded application, however, will share the heap memory andtherefore a mechanism is provided to override the default cell localmemory placement. For shared data, the interleaved pool works better toprevent a single cell from being saturated with memory requests. If asingle cell were used for shared data, processes sharing the data inother cells would send requests to the single cell, causing a systembottleneck.

Overriding Private Data Placement

Formulti-threaded processes, the private data allocated from the heap isshared among all threads in a process. Under CLM, this can potentiallyresult in very poor performance if the heap is allocated from a singlecell, or even just a few cells. Given this problem, HP-UX has a way tospecify that a given process should allocate its private heap data fromthe interleaved memory area instead of from CLM. The chatr(1) command can be used with the +id enableoption to specify that a given process should use the interleaved poolfor private heap memory. The default behavior is to use CLM for privateheap memory. The allocation behavior is inherited across process forks but not across a process exec.

Overriding Shared Memory Placement

Tooverride the default behavior of shared memory allocation, HP-UX hasmodified some memory allocation system calls to accept hints indicatingwhere memory should be allocated. For instance, the shmgetsystem call can be called using the IPC_MEM_LOCAL flag to create theshared memory segment from cell local memory, or it can be called withthe IPC_MEM_FIRST_TOUCH option to allow the memory to be allocated inthe cell where the first process accesses it. Additionally, the mmapcall can use the MAP_MEM_LOCAL flag to specify that memory should bemapped as cell local or it can use the MAP_MEM_FIRST_TOUCH to specifythat the first process touching the memory should map it as cell local.Applications that are “cell aware” may benefit from creating a sharedmemory region per cell. This could greatly reduce memory access latencyif the application could partition large pieces of its memory such thatprocesses and threads local to a particular cell used data from theshared memory segment on the given cell.

CLM Launch Policies

Giventhat HP-UX does not currently migrate cell local pages once they havebeen allocated in a locality domain, initial launch placement ofprocesses is usually critical to achieving good performance with CLM.The mpsched(1)command can be used to specify where processes launch after beingforked from a parent process. The launch policy is important becausethe process private data and stack memory, which are allocated fromCLM, may be allocated shortly after a process is launched, depending onthe application. Therefore, having a good initial placement ofprocesses is important for evenly allocating local memory among thecells. Table 8-3describes the per-process launch policies supported. The launchpolicies are inherited among all descendents created by a process. Whenusing the mpsched(1)command, the resulting processes that get created will stay bound tothe initial locality domain where they were launched. The processeswill then only be allowed to run on other processors within thatlocality domain. This is done given that HP-UX does not support pagemigration between locality domains. Therefore, if a process was movedto another locality domain, excessive cross cell traffic would resultas the process accessed its cell local memory residing on its launchlocality domain.

Table 8-3. mpsched Process Launch Policies
Policy Description RR Round Robin. Under this policy, direct child processes are launched one per locality domain in a round robin manner. Once each locality domain has launched a process, the initial domain is chosen again. LL Least Loaded. Under this policy, processes are launched on the locality domain that is least loaded with other active processes. This is a non-deterministic algorithm, given that it is based on the load of the system, which may change dynamically at any time. FILL Fill a domain. Under this policy, direct child processes are launched on every processor in a locality domain before moving to the next locality domain. Once all domains have been filled, the initial domain is selected again. RR_TREE This is similar to RR, but all descendents of a process participate in the RR algorithm versus just the direct children. The behavior of this algorithm is non-deterministic, given that processes launched from the children of the initiating process may launch in a different order from one invocation to the next. FILL_TREE This is similar to FILL, but all descendents of a process participate in the FILL algorithm versus just the direct children. The behavior of this algorithm is non-deterministic, given that processes launched from the children of the initiating process may launch in a different order from one invocation to the next. PACKED Pack a domain. Under this policy, all processes forked from a process are launched in the same locality domain as the parent process. NONE The default HP-UX launch policy is used. (NOTE: the default can be subject to change from release to release.)

The following is an example of how to launch a process named “clm_test” that will create four worker processes using the mpsched(1) round robin launch policy on a system with four locality domains:

# mpsched -P RR clm_test
Pid 1234: bound to locality domain 0 using the round-robin process
launch policy

Performing the mpsched -q command shows the resulting bindings for the processes created:

# mpsched -q
Pid 1234: bound to locality domain 0 using the round-robin process
launch policy
Pid 1235: bound to locality domain 1 using the round-robin process
launch policy
Pid 1236: bound to locality domain 2 using the round-robin process
launch policy
Pid 1237: bound to locality domain 3 using the round-robin process
launch policy
Pid 1238: bound to locality domain 0 using the round-robin process
launch policy

Noticethat each child process (processes 1235-1238) was placed on its ownlocality domain, allowing it to allocate private memory and use CPUsassociated with that domain.

Thelaunch policy one chooses will usually be dependent on the workload.Most workloads, however, work well with the round robin (RR) policy, sothis should be used as a first attempt. FILL should be used as a secondattempt. Note that the RR and FILL policies are preferred for typicalapplications, while the RR_TREE and FILL_TREE policies are intended forspecial situations where the application may benefit from a whole-treelaunch policy.

For complete control over launch policies in an application, HP-UX also provides all of these launch policies via the mpctl(2) system call. See the mpctl(2) man page for specific usage. By using the mpctl(2)system call, a program can have complete control over how subprocesses(and threads) are launched. Some could potentially use round robin,while others could use FILL or PACKED. The behavior of the childprocesses and threads would dictate which policies to choose.

Finally, processor sets (see Section 7.3.3, “Processor Sets” on page 188)can be used to configure processors such that all of the processors fora given cell reside in the same processor set. Then any processesallocated to the processor set would be cell local without needing touse mpsched(1).

  • Create Bookmark (Key: b)Create Bookmark
  • Create Note or Tag (Key: t)Create Note or Tag
  • Download (Key: d)Download
  • PrintPrint
  • Html View (Key: h)Html View
  • Zoom Out (Key: -)Zoom Out
  • Zoom In (Key: +)Zoom In
  • Toggle to Full Screen (Key: f)
  • Previous (Key: p)Previous
  • Next (Key: n)Next

Related Content

Memory Management
From: Linux® Kernel Primer, The: A Top-Down Approach for x86 and PowerPC Architectures

Global Memory
From: Microsoft Windows 2000 API SuperBible

Memory-Mapped Files
From: Advanced UNIX Programming

Processes and Memory
From: The Design and Implementation of the 4.4BSD Operating System

Subsystem Policies and Controls
From: Resource Management

Introduction to the Memory Manager
From: Microsoft® Windows® Internals: Microsoft Windows Server™ 2003, Windows XP, and Windows 2000, Fourth Edition

Memory management
From: Operating Systems: Concurrent and Distributed Software Design

Memory Performance Statistics
From: Optimizing Linux® Performance: A Hands-On Guide to Linux® Performance Tools

Managing Memory
From: Essential System Administration, Third Edition

Memory Model and Management
From: Microsoft® Windows® 2000 Server Unleashed