Performance Analysis Tools for Linux Developers: Part 1

来源:百度文库 编辑:神马文学网 时间:2024/04/28 02:43:57
Performance Analysis Tools for Linux Developers: Part 1

Performance analysis and profiling for Intel Processor Architectures

ByMark Gray and Julien Carreno
October 20, 2009
URL:http://www.ddj.com/open-source/220700195

Mark Gray is a software development engineer working at Intel onReal-Time embedded systems for Telephony. Julien Carreno is a softwarearchitect and senior software developer at specializing in embeddedReal-time applications on Linux.


With the advent of the Intel Atom processor and multicore processors,Intel architecture processors are proliferating in a number of newmarket segments, most notably embedded systems where good performance isessential. In parallel with this trend, Linux is becoming anestablished operating system option for embedded designs. The two trendscombined pose an interesting problem statement: "How to get the mostout of my embedded application running on an Intel platform and ageneral purpose operating system?" During all kinds of applicationdevelopment, there comes a time when a certain level of performanceanalysis and profiling is required, either to fix an issue or to improveon current performance. Whether it is memory usage and leaks, CPUusage, or optimal cache usage, analysis and profiling would be almostimpossible without the right tool set. This article seeks to helpdevelopers understand the more common tools available and select themost appropriate tools for their specific performance analysis needs.

In Part 1 of this article, we summarize some of the performance tools available to Linux developers on Intel architecture. In Part 2we cover a set of standard performance profiling and analysis goals andscenarios that demonstrate what tool or combination of tools to selectfor each scenario. In some scenarios, the depth of analysis is also adetermining factor in selecting the tool required. With increasinglydeeper levels of investigation, we need to change tools to get theincreased level of detail and focus from them. This is similar to using amicroscope with different magnification lenses. We start from thesmallest magnification and gradually increase magnification as we focuson a specific area.

top and ps

The top and ps commands are freely available on all Linux distributions and are generally installed by default. The pscommand provides an instantaneous snapshot of system activity on aper-thread basis, whereas the top command provides mostly the sameinformation as ps updated at defined intervals, which can be as small ashundredths of a second. They are frequently overlooked as tools forunderstanding process performance at a system level. For example, mostusers tend to use the ps -ef command only to check which processes are currently executing. However, pscan also print useful information such as resident set size or numberof page faults for a process. A thorough examination of the ps man pages reveals these options. Likewise, top can also display all this information in various formats while updating it in real-time. The top command window also displays summary information at the top of the window on a per-CPU basis.

In Figure 1, top is showing information for all threads of aprocess on a multicore machine. Using this more detailed view, we cansee total activity on CPU, all threads of the process "app" and on whichCPU each thread is scheduled at that instance (P). You can also seememory usage for the process including resident set size (RES) and totalvirtual memory use (VIRT).

Figure 1: top View (Idle System)

In Figure 2, we can see similar information using ps. We can see the CPUusage on a per-thread basis with 1/10 % accuracy. This is thecumulative CPU percentage since the spawning of the thread.

Figure 2: ps View (Idle System)

As can be seen, top and ps provide a good general overview of system performance and the performance of each process running on the system.

free

The free application is freely available on all Linux distributions andis generally installed by default. Similar information can be foundusing top or sar, but it is a convenient command to view a snapshot ofsystem memory usage and can be used to identify memory leaks (theallocation of memory blocks without ever freeing them) or disk thrashingdue to excessive swapping.

Figure 3: free View (Idle System)

oProfile

The oProfile utility is a system-wide profiler and performancemonitoring tool for user space as well as kernel space (the kernelitself can be included in the profiling). The profiler introducesminimal overhead and as such can been seen as relatively unobtrusive.However, it does require that the gdb debugging (-g) flag beused. Although active since 2002 and stable on a majority of platforms,oProfile still dubs itself as an alpha quality open source tool. Thetool is released under GPL and can in fact, be found in post 2.6 kernelsby default. The tool works by collecting data via a kernel module fromvarious CPU counters, then displaying that information to user-space viaa pseudo file system in the same way as ps collects data via the"/proc" file system.

Figure 4: opreport from oProfile)

gprof

The GNU profiler, gprof, is an application-level profiler. The tool isopen source, licensed under GDB and is available as standard on mostLinux distributions. Compiling the code using gcc with the -pgflag instruments the code producing an executable that measures the wallclock execution time of functions with a hundredth of a second accuracyand exports this information to a file. This file can then be parsed bythe gprof application giving a flat-profile representation of theperformance data and a call-graph.

Figure 5: gProf View

The profiler collects data at sampling intervals in the same way as manyof the tools described in this paper. Therefore, there may be somestatistical inaccuracies of the timing figures if the run-time figure isclose to the sampling interval. By running your application for longperiods of time, you can reduce any statistical inaccuracies. As can beseen from the output in Figure 5, gprof can help locate hot spots atfunction granularity. However, it also allows you to compile thisinformation at a finer granularity using the -l flag.

As an unexpected side-benefit, gprof can suggest function and file orderings within your binary to improve performance.

valgrind

valgrind is an instrumentation framework that can be used primarily fordetecting memory-related errors and threading problems, but is alsoextendable. It is an open source tool licensed under GPL2. The tool candetect errors such as memory leaks and incorrect freeing of memory. Thevalgrind tool detects these errors automatically and dynamically as thecode is executing. In some cases it can produce false positives.

However, the developers of valgrind claim that it produces correctresults 99% of the time and any errors can be suppressed. Although it isa very useful tool, it can be extremely intrusive as the code runs muchslower than its true execution speed (by a factor of 50 in some cases)and needs to be compiled with the gcc -g flag. It is also recommended to be compiled with no optimization of code using the gcc -O0 flag. An example of the execution of a small binary through valgrind can be seen below.

Figure 6: valgrind Example

Although, it may be useful in some cases, for real-time applicationsthat wait on I/O, valgrind can be so obtrusive as to make the checkingunreliable. However, valgrind can be a highly useful tool when used inconjunction with a unit test and/or nightly build strategy. A clean runof valgrind in a nightly build allows the developer to keep track of anynewly-introduced latent memory errors.

Like many of the tools presented here, valgrind is not limited to thepurpose that most developers have in mind. For example, valgrind canalso check for cache misses and branch mispredictions. WE stronglyencourage you to read the relevant documentation and play around withthis and all tools to fully appreciate their power.

VTune

VTune1 from Intel is a proprietary system-level profiler and performanceanalysis tool for Intel architecture. It introduces minimal overheadand therefore can be perceived as relatively unobtrusive. VTune works bycollecting data via a kernel module from various CPU counters. Thisinformation is collected when an interrupt is generated. The granularityof the data can run from a process level down to an instruction leveland is accessible through a highly-usable and configurable GUI.

VTune, when fully configured for your application and operating system,can identify performance issues at several levels of granularity fromsystem-level to microarchitecture-level. As a tool for developers, it isextremely valuable since it has a global view at all granularities. OSperformance counters can also be monitored and correlated toinstruction-level hotspots. By using this correlation, we can answerquestions such as "When the memory use in our system begins to ramp,what happens to our applications CPU usage?" If the source code in yourtest application is hooked into the VTune application, we can also drilldown from the application level into threads and down to codefunctions.

It is impossible to outline all the features of VTune and indeed many ofthese tools described in this paper, however, the interested reader isdirected to the references.

Intel Thread Checker

The Intel Thread Checker is a plug-in for the VTune debuggingenvironment. It can be used to locate hard to find threading errors suchas race conditions and deadlocks.

sar

The system activity reporter (sar) is a lightweight open source toollicensed under GPL that is used for collecting system-wide performancemeasures. The tool is generally installed by default on Linux, however,sometimes it may need to be installed using the sysstats package. Liketop and ps, sar collects data from operating system counters via theproc file system. It provides performance data at system-levelgranularity reporting on a wide variety of metrics such as CPU usage,disk IO, memory, network IO, and IRQ. The tool can update these valuesat intervals of a minimum of 1 second.

sar can only provide information at system-level granularity and is usedonly to provide snapshots and overviews of overall system performance.Spurious or unexpected measurements from sar can be a first indicationof performance issues of the system as a whole or of a single process orgroup of processes. It can be configured to run in the background,constantly providing a readily accessible database of system performanceat any second during the day.

Figure 7: sar System-wide CPU Usage View

Figure 8: sar System-wide Memory Usage View

LTT

Linux Trace Toolkit (LTT) consists of a kernel patch and tool chain thatgives the user the ability to trace events on the system. These eventscan be system kernel events (such as context switches, or system calls,and so on) or any application-level event. It is GPL licensed and hasminimum impact to the run-time performance of traced applications. Itcan be used to isolate performance problems on parallel and real-timesystems and analyze application timing. Any code that the user wouldlike to be analyzed needs to be recompiled to be instrumented by LTT.

Alternatively, LTTng (Next Generation) is also available, which adds features such as a GUI Trace Viewer. See Figure 9.

Figure 9: Sample LTTng Viewer

iostat

The iostat command is used for monitoring system input/outputblock device loading. With multiple block devices in the system, it canbe useful to determine which device(s) is currently the bottleneck. iostatprovides a per device view of the number of transfers per second oneach device as well as read and write rates. See Figure 10, for anexample of the "extended iostat device" only output during a large filecopy. Note: Observe the temporary increase in device activity while thefile was being copied.

Figure 10: Sample iostat View (File Copy Example)

iotop

iotop is a Python program with a top-like user interface that can beused to associate processes with I/O. It requires Python version 2.5 orgreater and a Linux kernel version 2.6.20 or later with theTASK_DELAY_ACCT and TASK_IO_ACCOUNTING options enabled. Therefore, apotential recompilation of the kernel may be required if these optionshave not been enabled by default. iotop is licensed under GPL. iotopprovides data regarding the amount of Disk IO occurring within thesystem on a per process basis. This lets users determine whichapplications are using the disk(s) the most.

Figure 11: Sample iotop View

Tools Summary

Alternative Tools