Managing Application Thread Use

来源:百度文库 编辑:神马文学网 时间:2024/04/29 20:25:19
Multicore processors are increasingly replacingsingle-core processors, and developers are being confronted with newchallenges when using them.
ByLevent Akyil
September 04, 2008
URL:http://www.ddj.com/high-performance-computing/210300581
Levent is a staff software engineer in Intel's Performance, Analysis, and Threading Lab. He can be contacted atlevent.akyil@intel.com.
From high-end servers to desktops and laptops, multicore processors areincreasingly replacing single-core processors. At the same time,however, application developers are confronted with new challenges whencreating software that efficiently uses these multicore processors.
The specific challenge I examine in this article is the situation inwhich multiple CPU-bound multithreaded applications create"oversubscription" on a server and/or desktop system. The problem occurswhen individual applications create as many threads as the number ofcores available, and when each of these threads performs CPU-intensivetasks. Consequently, oversubscription occurs, negatively impactingoverall performance. In other words, without knowing the actual systemload and other applications running on the system, each multithreadedapplication competes with each other for system resources.
Framework Details
The current generation of nonreal-time operating systems provide sometype of fair resource (processor/core) allocation or scheduling; thus,the overall performance is impacted negatively when multiplemultithreaded applications execute in parallel. Of course, there havebeen proposals—static mapping (done before applications are started) anddynamic mapping (calculated at runtime), for instance—about howoperating systems can provide some level of quality of service (QoS) andhow application QoS values can be mapped to processor capacity.
The concept I propose here involves neither solving the oversubscriptionproblem, nor providing a QoS service. Instead, my proposal involves alightweight thread usage balancing framework that can be implementedwithout any changes to operating systems or runtime environments.(However, getting this functionality implemented in operating systemswould make this mechanism available to all the applications on demand.)"Thread usage balancing" in this context refers to adjusting the numberof threads each application uses, based on the utilization of thecores/processors. If not already using the maximum number of threadsallowed (normally the number of physical cores available on the system),the applications only increase their thread count to the maximumallowed during their runtime.

Figure 1: Block diagrams showing the framework (loads are the applications).
As Figure 1 illustrates, there are two fundamental parts to the dynamic load-balancing framework:
A stateless load governor running in the background, which monitors the utilization of the cores periodically. "Stateless" in this context means that the governor doesn't retain any information on how many threads each application uses.
The library that applications use to communicate to the governor and adjust their thread count accordingly. The communication takes place via sockets over TCP/IP. All applications set the number of threads to use for parallel regions (in the OpenMP sense), based on the feedback received from the governor, rather than setting the number of threads as the number of processors physically available or as they wish.
Also, applications only increase their thread count, but can't set itmore than the physical number of cores. This makes the initial threadcount with which the application starts worth mentioning.
The Load Governor
The load governor (Figure 2) uses a basic metric—processor idle time—tocompute system utilization. Based on the idle time, the governoridentifies available cores and returns the number of available cores tothe requesting applications. For example, if the idle threshold is setto 5, the governor marks all cores that are idle 5 percent or more oftheir time as available (95 percent or less utilized) at the time of thesnapshot. The processor statistics (system time, idle time, and so on)are collected from /proc/stats periodically (with Linux); this periodcan be adjusted as a parameter to the governor.
The goal of the governor's algorithm is not to prevent oversubscription,but rather to allow applications to finish as soon as possible. Thereflection of the algorithm on the applications can be described as"slow start fast finish" with the exception of the first multithreadedapplication that contacts the governor, because the first application isguaranteed to start with as many threads as the number of cores. (Theassumption here is that when the first application starts, theutilization of all the cores is less than the defined threshold.)

Figure 2: Block diagram showing the internals of the load governor.
The Thread Count Library
The second part in this framework is the library providing the API callsto communicate to the governor, and API calls to adjust the number ofthreads based on the number of available cores received as a response.The applications increment their thread count by the number of availablecores returned, with the ceiling being the number of physicallyavailable cores.
The modifications to the original applications are minimal:
Include two header files. One for the thread count and governor communication, and one for the timing calculations. The header file for the timing calculations is not necessary but is used during the simulation.
Link with library.
Modify the code so that just before entering a parallel region, thread count is updated.
Also, the applications were instrumented to run many iterations ratherthan just one. And before entering a parallel region, each applicationsets its thread count, which was updated by this sample OpenMP (www.openmp.org) code segment:
init_thread_count ( ); // initialize thread countwhile ( iter < ITERATION ){set_thread_count(); // update thread countstart_timer();PARALLEL_REGION //original benchmark#pragma omp parallel forfor (...){WORK ( );}stop_timer();}
Methodology
To simulate multiple multithreaded applications running in parallel, Iused benchmarks parallelized via OpenMP. I chose OpenMP because it iseasy to prototype with and easy to implement the proposed framework.Some of the benchmarks were taken from the OpenMP source code repository(SCR) project (http://sourceforge.net/projects/ompscr) and others were written from scratch. All the benchmarks are CPU-bound and scale well.
The simulation is performed in two different scenarios:
In the uniform run, two or more instances of the same benchmark were executed.
In the mix run, one or more instances of all benchmarks were run together simultaneously.
Again, the first benchmark that is started gets the maximum threadcount, which is equal to the number of physically available processors.
The baseline measurement for both uniform and mix scenarios is takenwhen any running benchmark sets the number of threads as the number ofprocessors available on the system, rather than changing their threadcount during runtime (no load governor is running). In other words, if n different benchmarks or n instances of any benchmarks are executed simultaneously, there is a total of (n * the number of cores) threads.
Test Results
In this simulation, I used a 2.67-GHz Intel Core 2 quad-coreprocessor-based DP server with a total of 16-GB RAM running on FedoraCore 6. This platform provided 8 cores for the simulation. The loadgovernor was set to run every 1 second to read the processor statisticsfrom /proc/stats. Each benchmark communicated with the load governorevery 10 seconds to update their thread count. The iteration count(ITERATION) was set as 100. Idle time threshold was set to 5, 10, 15,and 20 during the simulation for each test setup. The benchmarks I usedin the simulation were: Prime, which finds the prime numbers within arange; Mandel, OpenMP SCR's Mandelbrot implementation that computes anestimation to the Mandelbrot Set area using MonteCarlo sampling; MD,OpenMP SCR's molecular dynamic simulation; and Matvec, matrix vectormultiplication. All benchmarks were compiled using Intel C++ compilerversion 10.0.
Uniform Test Results
This scenario was simulated with two and four instances of each benchmark.
The pseudocode for the uniform simulation can be given as follows: The #of instances in the simulation was set to 2 and 4, respectively.
for threshold in 5 10 15 20dofor load in {prime, mandel, md, matvec}dostart governor -thresh -methodfor i in # of instancesdostart loadsleep donewait_for_all_loads_to_finishstop governordonedone
The uniform simulation results (Figure 3) show that regardless of thenumber of instances of the benchmarks (loads), the aggregate elapsedtime can be reduced anywhere from 5-20 percent. These results also showthat if the number of instances of a benchmark running in parallel isincreased, then aggregate elapsed time can be decreased further. Eachinstance of any benchmark is capable of fully utilizing all cores on thesystem; therefore, various threshold values used in the simulationdidn't make a significant difference.

Figure 3: Uniform test results combined (4 instances of each benchmark; idle threshold 10).
Mix Run Test Results
In the real world, however, it is more common to have multiplemultithreaded applications running in parallel and competing for thelimited number of cores available on the system, rather than running thesame application in parallel. As multithreaded applications are morewidely available, this is increasingly becoming an everyday scenario.Therefore, the idea behind the mix run is to simulate this case. Duringthe simulation, a single instance of each multithreaded benchmark wasexecuted simultaneously.
The pseudocode for the mix simulation can be given as follows: # of instances here was set to 1 and then 2, respectively.
for threshold in 5 10 15 20dostart governor -thresh -schedfor load in {prime, mandel, md, matvec}dofor i in # of instancesdostart loadsleep donedonestop governordone
Again, the first instances of each load will get the maximum threadcount, which is equal to the number of physically availableprocessors/cores. This ensures that the first benchmark to start willcomplete as fast as possible while others run slower.

Figure 4: Single instances of each benchmark are executed in parallel.
In Figure 4, the total speed-up (that is, decrease in elapsed time ofall benchmarks running simultaneously) is 1.23x. Only one out of fourbenchmarks (matvec) was negatively impacted in this simulation run.After analyzing all the simulation results, it became clear that withthis framework, the execution time of each benchmark, and the order inwhich they are started, made a difference in the aggregate performance.The best aggregate elapsed time improvement was achieved when thebenchmark that takes the least amount of time to complete was startedfirst, and the second fastest benchmark as the second. This can beexplained in the following manner:
Let Li be any given benchmark (load), TLi be the execution time for a benchmark, and n the number of benchmarks. If benchmarks' time to completion is:
TL1 <..then the best results are achieved when L1 is started first, then L2. However, it is not guaranteed that L1 will pick up the available cores when L1completes; any benchmark can pick up the available cores based on thefact that they communicate with the load governor independently andbased on their own timer. In the best mix run, a total of 23 percentimprovement on aggregate elapsed time was achieved.
But if again:
TLn.>...> TL(i-1) > TLi >...>TL1
then starting Ln first gives the worst aggregate elapsed time response. The explanation to this is that Lnwill take all the available cores and yet still finish last, so thatother benchmarks will start with a low number of threads and willcomplete without having a chance to increase their thread count. This isthe only condition where a traditional (oversubscription) frameworkwill out-perform the proposed framework.
It was also noted that the more loaded the system or the longer theapplications ran (higher iteration count), the better the aggregateperformance.
The dynamic nature of the framework can be analyzed by the Intel ThreadProfiler. Figure 5 shows the Thread Profiler analysis of one of thebenchmarks during the test run. One can easily see that OpenMP workerthreads were increased every time more processors became available.
[Click image to view at full size]

Figure 5: Intel Thread Profiler analysis showing how one benchmarkincreases its OpenMP worker threads dynamically within the framework.
Conclusion
Many applications are becoming demanding in how they use processorresources to take advantage of parallelism. However, these applicationsare not aware of the other applications running alongside them on thesame system. Clearly, the lack of system-wide knowledge hinders not onlythe applications' performance but also the overall system performance.
This framework showed that with a very basic metric and feedbackmechanism, overall performance of the applications can be significantlyimproved. While the described lightweight framework is easy to implementand to take advantage of, it has some aspects that need improvement.Therefore, further study is needed on how to eliminate the dependency onthe execution order of the applications. This can be solved by lettingthe load governor keep some state information about the thread usage ofapplications and their timing. In this manner, rather than merelyincreasing their thread count, applications can also decrease the numberof threads they use.