CLR Inside Out: Using concurrency for scalability

来源:百度文库 编辑:神马文学网 时间:2024/04/28 16:28:15
CLR Inside Out
Using concurrency for scalability
Joe Duffy
Code download available at:CLRInsideOut2006_09.exe(151 KB)
Browse the Code Online Contents
A Tale of Hardware Threads
Memory Hierarchies
Units of Work
Know the Costs
Defining Boundaries
How Many Tasks?
Shared State
Tools for Profiling Parallelism
Conclusion
There'sbeen a lot of buzz lately about concurrency. The reason for this is duemostly to major hardware vendors' plans to add more processor cores toboth client and server machines, and also to the relatively unpreparedstate of today's software for such hardware. Many articles focus on howto make concurrency safe for your code, but they don't deal with how toget concurrency into your code in the first place.
Bothtasks are important and can be difficult for different reasons.Randomly creating new threads and sprinkling calls toThreadPool.QueueUserWorkItem all over your codebase isn't going to leadto good results. You'll need to take a more structured approach. First,let's take quick stock of the situation.
Overthe course of the 90s, parallelism has grown to become a silent enablerof software scalability in newer installments of processorarchitectures. While most of us didn't even have to realize it wasthere, or even write code differently to take advantage of it,parallelism has nevertheless been used. Instruction-level parallelism(ILP) techniques are layered underneath existing sequential programmingmodels, carried out at the granularity of an individual instructionstream, which employs branch prediction, speculation, and dataflowout-of-order execution. Pipelining, for example, can lead to 25-30percent performance increases, depending on the depth of the pipelineand workload. Such techniques, when coupled with clock-speed increases,ensured that software continued to run faster with each generation ofhardware while requiring minimal additional work of the software.
Whilechip vendors still expect Moore's Law to continue (doubling the numberof transistors in processors every 18 months or so), what engineers cando with those transistors has begun to shift. Increasing the clockspeed of new chips at the same rate seen in the past is simply notpossible, primarily because of the heat generated. However, it ispossible to use the additional transistors to put more low-latencymemory and, if the software can take it, more cores on the chip. Notethe qualification. Much software today assumes a single-threaded viewof the world, and this needs to change if you expect to make use ofthose extra cores.
Insome sense, a large chunk of the responsibility for making software gofaster with the next generation of hardware has been handed off fromhardware to software. That means in the medium-to-long term, if youwant your code to get faster automatically, you'll have to startthinking about architecting and implementing things differently. Thisarticle is meant to explore these architectural issues from the bottomup, and aims to guide you through this new world. In the long run, newprogramming models are likely to appear that will abstract away a lotof the challenges you will encounter.
A Tale of Hardware Threads
Symmetric multiprocessor (SMP) computers capable of running Windows®have been on the market for years, although typically found only inservers and high-end workstations. These machines contain one or moreboards, each board typically containing multiple sockets, with eachsocket holding a complete CPU. Each CPU in this regard has its ownon-die caches, interrupt controllers, volatile state (registers), and aprocessor core, including its own execution units. The Windowsscheduler maps individual software threads to individual CPUs, which inthis case are entirely independent. I'll call these CPUssingle-threaded in the hardware thread sense. Because each unit isrelatively isolated (aside from shared memory architecture, which I'lldiscuss shortly), you can improve execution throughput for each new CPUadded to a machine if enough software threads are offered up forexecution.
Intelintroduced Hyper-Threading (HT) for the Pentium 4 processor series. HTpacks an extra set of interrupt controllers and volatile state onto asingle physical CPU in a single socket, enabling multiple softwarethreads to execute in parallel on distinct logical processors, althoughthey share the same set of execution units. This approach is similar tothat taken earlier by supercomputer companies such as Tera. Due tolatencies associated with accessing memory, among other things, theinstructions between the two logical CPUs threads can frequentlyinterleave, leading to a parallel speedup. In this sense, HT-enabledCPUs are two-threaded because the Windows scheduler can map tworunnable threads to an HT processor simultaneously. In reality, HT issuitable for some workloads, and has been cited as leading to a 15-40percent improvement on real-world programs.
Multi-coretechnology, which is already readily available for client and servermachines alike, replicates the per-CPU architecture on a single chip,enabling a single socket to contain multiple full CPUs. Dual-core isavailable now (two cores on a chip), with 4-core, 8-core, and beyond onthe horizon. Unlike HT, multi-core CPUs have individual executionunits, and therefore are generally capable of more substantial parallelspeedups. Much like individual CPUs, each core is logically distinctaside from shared memory architecture. This means it's possible thattwice as many cores doubles throughput. In this sense, the number ofcores is the number of threads that can be running simultaneously. Ofcourse, these technologies are not mutually exclusive. A 4-socket,4-core HT computer amounts to 32 hardware threads. That's quite a bitof horsepower.
Memory Hierarchies
Memoryinteraction is often a substantial factor in the performance of modernsoftware. Your typical computer contains a fairly complex memorysystem, consisting of multiple levels of cache between the processorsand the actual DRAM banks. SMP machines traditionally have consistenthierarchical designs, although more exotic architectures are availableand could become more prevalent with the increasing availability ofmassively parallel machines. One such exotic architecture isNon-Uniform Memory Access (NUMA), where multiple main memories arededicated to nodes of multiple CPUs. Cross-node communication ispossible, although extremely expensive. Various parts of Windows andthe CLR change strategy to account for NUMA. Intelligent parallel codeoften has to do the same.
Cache-friendlyconcurrent software uses memory intelligently and efficiently,exploiting locality to reduce the total number of cycles required for agiven computation. Locality comes in two major flavors. First isspatial locality: data close together in memory will be used closetogether in a program's operations. While bigger cache lines mean youmay end up pulling more data than necessary into cache, a program withgood spatial locality will take advantage of this by subsequentlyaccessing other addresses that were already pulled into cache. The CLRgarbage collector maximizes spatial locality by doing allocationscontiguously, for example.
Temporallocality is the concept that memory stays in cache for a reason: if itwas recently accessed, you can expect that it might be accessed againsoon. Modern caches use eviction policies that optimize usingpseudo-least-recently-used (LRU) techniques.
Well-writtenparallel software can even observe super-linear speedups occurring fromkeeping more data in cache and sharing less with other threads. Thatis, on an n -CPU machine, the software might run more than ntimes faster than on a single CPU machine. On the other hand, the costof a "cache miss" is fairly expensive. This is illustrated further bythe relative comparison of cache-access costs in Figure 1 . As with all rules of thumb, take the numbers with a grain of salt and pay closer attention to the orders of difference.
Figure 1  Logarithmic Graph of Comparative Access Costs
Parallelsoftware in particular needs to take notice of locality. Two threadsthat continually update data on intersecting cache lines can causecache line ping-pong, where the processors spend an extraordinarilyhigh amount of time acquiring exclusive access to a cache line, whichinvolves invalidating other processors' copies. Some cache lineinteractions are obvious, since there is real data sharing at theapplication level. Other interactions are less obvious and result fromdata residing close together in memory, which unfortunately can't beeasily determined simply by examining the algorithm.
Similarly,thread migration—which I'll discuss in more detail later—can cause athread to move to another processor and subsequently have to acquireand invalidate all of the cache lines it once held on the originalprocessor. This cache migration can cost on the order of 50 times thecost of an on-die cache hit for each line access requiring migration.On NUMA machines this can be disastrous due to the cost of inter-nodecommunication, though the migration problem can be partially avoidedthrough clever usage of processor affinity. These are costs to be awareof when writing highly parallel code. The newGetLogicalProcessorInformation API in Windows Vista™enables you to inquire about the machine architecture, including cachelayout and NUMA information, which could be used dynamically duringscheduling, for example.
Units of Work
Toget your software to execute in parallel, clearly you need to somehowbreak up the problems encoded in your algorithms into sub-problems:smaller units of work that I will refer to as tasks. A task takes someinput and produces some output, whether that output is a piece of dataor an action. It can execute in isolation, although it may have subtledependencies on state or ordering that might not be evident at first.
Functionspretty much already do this, you might say. But unlike ordinaryfunctions, which are defined statically as you write your code, towrite software that scales given an arbitrary number of CPUs, theboundaries of a task must often be discovered dynamically. Or at leastthe tasks must be presented to an intelligent architecture that knowswhether it is profitable to execute a task in parallel. Andfurthermore, to make a task execute in parallel, your code has tosomehow arrange for it to be called in such a manner rather than simplycalling it sequentially on the current thread. On Windows, thistypically means running on a separate OS thread. On the CLR, it meansperhaps queueing the work to execute on the ThreadPool.
Thephysical unit of execution on Windows is a thread. Each process beginslife with a single thread, but of course the code running in thatprocess is free to introduce and later tear down additional threads asit sees fit. The Windows scheduler is responsible for assigning threadsto hardware threads and allowing the code to execute. If there are morethreads than there are hardware threads, the scheduler has to be alittle more sophisticated; it picks the runnable thread of the highestpriority (subject to clever anti-starvation algorithms), and lets itexecute until a quantum has expired. Once the quantum expires, acontext switch occurs and the same scheduling algorithm chooses thenext thread to execute. The length of a quantum varies based on the OStype and configuration, but it will generally be around 20ms for clientplatforms and 120ms for server platforms. A thread can block as aresult of performing I/O, attempting to acquire a contended lock, andso on. In this case, just as with a context switch, the scheduler willpick a new thread to execute.
Asmentioned earlier, it's crucial to performance of traditional SMPsystems that as much data reside in cache as possible. "Data" in thissense refers to the code being executed, the heap data beingmanipulated by the thread's algorithms, and the thread's stack. Asthreads are switched into and out of CPUs, Windows automaticallyemploys so-called ideal processor affinity in an attempt to maximizecache efficiency. For example, a thread running on CPU 1 that getscontext switched out will prefer to run again on CPU 1 in the hope thatsome of its data will still reside in cache. But if CPU 1 is busy andCPU 2 is not, the thread could be scheduled on CPU 2 instead, with allthe negative cache effects that implies.
Know the Costs
Threadsaren't free. They incur CPU and memory costs, which you should keep inmind. If your goal is to use concurrency to increase the scalability ofyour algorithms, presumably you'll also be spending as much (if notmore) time doing traditional performance profiling work. Running asloppy algorithm in parallel will do nothing but make your sloppyalgorithm use up more system resources. Ensuring that the mostimportant hot sections of your code are as efficient as possible in thesequential case is crucial for taking advantage of parallel scaling.
Todetermine what costs you can afford, there are some general rules ofthumb. The cost of creating a Windows thread is approximately 200,000cycles, whereas the cost of destroying one is about 100,000 cycles. Soright there you know that if your intention is to spin up a new threadto execute 100,000 cycles' worth of work, you're paying a heftyoverhead—and, if I had to guess, you won't observe any type of speedupeither.
Thememory overhead varies based on configuration. But most managed threadswill reserve 1MB of user stack space and will commit the entire amount,which means the memory must be backed physically either in real RAM orthe page file. There is also a small set of kernel stack pagesrequired, three pages on 32-bit systems and six pages on 64-bitsystems. An additional 10-20KB of virtual memory is used by other datastructures, but this is dwarfed by the memory required by the stack.GUI threads are slightly more expensive still, because they must set upadditional data structures such as the message queue.
Now,if you end up with too many threads of equal priority, you will have tocontext switch often. A context switch costs 2,000–8,000 cycles,depending on the system load and architecture, and involves saving thecurrent volatile state, selecting the next thread to run, and restoringthe next thread's volatile state. This may not sound like a lot,especially when compared to the duration of a quantum and the cost ofsubsequent cache misses, but it represents pure overhead that is takenaway from executing application code.
Giventhat you'd like to minimize the cost of introducing and destroying newOS threads and the negative consequences of accidentally introducing"too much" work, you should consider using the CLR's thread pool. Ithides clever thread injection and retirement algorithms beneath asimple interface, amortizing the cost of creating and destroyingthreads over the life of your program. And using the ThreadPool classis simple.
Thatsaid, even using the ThreadPool costs something. Invocations toQueueUserWorkItem incur a sequential cost to the caller, and theinfrastructure that dispatches work from the pending work item queuealso imposes an overhead to parallel work that is being executed. Forcoarse-grained parallelism, these costs are so small that you're likelynot to notice them. But for extremely fine-grained parallelism, thesecosts could become a significant scalability bottleneck. You mightconsider building your own lightweight ThreadPool out of lock-free datastructures, avoiding some of the costs incurred by a general purposeThreadPool, like ensuring fairness between AppDomains, capturing andrestoring security information, and so forth. For most uses, however,the stock ThreadPool is up for the task.
Defining Boundaries
Figuringout how to split up your work is not a negligible activity. Whendealing with CPU-bound workloads, the job is focused more on avoidingperformance overhead associated with concurrent execution. But mostworkloads are not CPU-bound; they combine various forms of I/O andsynchronization among CPU work, either of which can lead tounpredictable blocking patterns. And thus, with most code, concurrentexecution is more about orchestrating complex coordination patternsthan it is about lower level performance.
Perhaps the simplest way to partition work is to use the server model of concurrency. In servers like SQL Server™or ASP.NET, each incoming request is considered an independent task andconsequently runs on its own thread. The host software often throttlesthe number of real OS threads used so as not to over-introduceconcurrency. Most workloads like this are composed of totallyindependent tasks that access disjoint sets of data and resources,leading to highly efficient parallel speedups. For client programs,however, few workloads fit into this model cleanly. Fielding andresponding to peer-to-peer communication, for example, can be done viathis model, but unless a large number of work-intensive incomingrequests are expected, the ceiling on potential speedups here is fairlylimited.
Analternative is to carve out arbitrary sub-tasks in your code using amore logical and subjective definition of "significant task," whichtends to be more conducive to client-side workloads. A complex softwareoperation typically consists of multiple logical steps, for instance,perhaps represented in your program as independent function calls thatthemselves contain multiple steps, and so on. You might considerrepresenting each function call as an independent task, at least thosethat are substantial enough. This is tricky in the sense that you mustconsider ordering dependencies, which adds substantial complexity tothis idea. Most modern imperative programs are full of unstructuredloops, accesses to data via opaque pointers that may not be closetogether in memory, and function calls, none of which document cleanlywhat dependencies exist. And of course there's hidden thread affinitythat you might not know about. So this technique clearly requires adeep understanding of the problem your code is trying to solve, andsome thought about the most efficient way to perform it in parallel,eliminating as many dependencies as possible.
Acommon related pattern is fork/join parallelism, where a master taskforks multiple child tasks (which themselves can also fork childtasks), and each master task subsequently joins with its children atsome well-defined point. As an example, consider a model of task-levelparallelism called fork/join futures, which encapsulates this patternbased on function calls as the unit of task. This can be illustratedwith some new type Future (an implementation of Future is available for download from the MSDN ® Magazine Web site): Copy Code
int a() { /* some work */ }
int b() { /* some work */ }
int c()
{
Future fa = a();
Future fb = b();
// do some work
return fa.Value + fb.Value;
}
Themeaning of this code is that the invocations of a and b can execute inparallel with the body of c, the decision for which is left to theimplementation of the Future engine. When c needs theresults of those invocations, it accesses the Value property of thefuture. This has the effect of waiting for the work to complete or, ifthe work has not begun executing asynchronously yet, executing thefunction locally on the calling thread. This syntax maps closely to theexisting IAsyncResult class, but has the added benefit of some moreintelligence about how much concurrency to introduce into the program.While more clever implementations are easy to imagine, astraightforward translation of this code might look like this: Copy Code
int a() { /* some work */ }
int b() { /* some work */ }
delegate int Del();
int c()
{
Del da = a; IAsyncResult fa = da.BeginInvoke(null, null);
Del db = b; IAsyncResult fb = db.BeginInvoke(null, null);
// do some work
return da.EndInvoke(fa) + db.EndInvoke(fb);
}
Other approaches are possible, such as using longerrunning sub-tasks rather than requiring that children never live longerthan parents. This often requires more complex synchronization andrendezvous patterns. Fork/join is simple because the lifetime ofindividual workers is obvious.
Theprevious discussion takes a code-centric view of parallelism. Anothertechnique is often simpler: data parallelism. This is usuallyappropriate for problems and data structures that are data- andcompute-intensive, or whose individual operations tend to accessdisjoint data frequently.
Onecommon data parallelism technique is called partitioning. Loop-basedparallelism, for example, uses this approach to distribute computationsover a range of elements. Say you had 16 logical CPUs, an array of100,000 elements, and a piece of work that can execute withlittle-to-no dependency and tends to block 20 percent of the time. Youcould split the array into 20 contiguous chunks (you'll see how Icalculate this number later) of 5,000 elements apiece, fork off 19threads (reusing the current thread for one partition), and arrange foreach thread to do its calculations in parallel. Parallel queryprocessing in databases like SQL Server uses a similar approach. Thistechnique is illustrated in Figure 2 .
Figure 2  Partition-Based Parallelism
Theexample shows a 100,000-element array distributed over four threads.Notice some amount of sequential overhead is paid for the split. Incases where a merge is necessary, additional cost is often paid tomerge the results, including joining with outstanding threads.
For-allloops have generally been a traditional way to express partition-basedparallelism in programming languages. An example ForAll APIimplementation can be found in Figure 3 . Similarapproaches could also be used to parallelize loops—for example, ratherthan taking an IList, you could instead take an int from andint to set of parameters and then feed the loop iteration number intoan Action delegate.
 Figure 3 Simplistic Parallel Implementation
Copy Code
static void ForAll(IList data, Action a){ForAll(data, a, 0.0f);}static void ForAll(IList data, Action a, float blocking){if (blocking < 0.0f || blocking >= 1.0f)throw new ArgumentOutOfRangeException("blocking");int partitions = Math.Min(data.Count, (int)Math.Max(1.0f,(float)Environment.ProcessorCount / (1 - blocking)));int perCount = (int)Math.Ceiling((float)data.Count / partitions);int done = partitions - 1;ManualResetEvent mre = null;if (partitions > 1){mre = new ManualResetEvent(false);// Queue a work item per-partition. This is the "fork"// part of the work.for (int i = 0; i < partitions - 1; i++){int idx = i;ThreadPool.QueueUserWorkItem(delegate{for (int j = idx * perCount;j < (idx + 1) * perCount;j++){a(data[j]);}if (Interlocked.Decrement(ref done) == 0){mre.Set();}});}}// Execute one partition’s worth of operations on the// calling (current) thread.for (int i = (partitions - 1) * perCount; i < data.Count; i++)a(data[i]);// If we queued async partitions, wait for them to finish.// This is the "join" part of the work.if (mre != null) {mre.WaitOne();mre.Close();}}
Thiscode makes one major assumption that could be disastrous: theAction delegate passed in is expected to be safe to execute inparallel. This means that if it refers to shared state, it needs to useproper synchronization to eliminate concurrency bugs. If it isn't, wecan expect the correctness and reliability of our program to be quitepoor.
Anotherdata-parallelism technique is pipelining, where multiple operationsexecute in parallel, streaming data between each other using a fast,shared buffer. This is akin to an assembly line, where each step in theprocess is given its chance to interact with some data and then pass iton to the next step in line. This technique requires cleversynchronization code to minimize time spent at the obvious bottleneck:the point where adjacent stages in the pipeline communicate through ashared buffer.
How Many Tasks?
Choosingthe number of tasks to create is also a tricky equation. If throughputis the sole priority, you can use some theoretical target such as thefollowing, where BP is the percentage of time tasks will block: Copy Code
NumThreads = NumCPUs / (1 – BP)
Thatis, you'd like the number of threads to be equal to the ratio of CPUsto the percentage of time tasks are spent doing real work. This wasillustrated in the ForAll example earlier. Unfortunately, while this isa good theoretical starting point, it won't get you a precise answer.It doesn't account for HT, for example, where high memory latenciespermit computations to occur in parallel, but otherwise shouldn'taccount for a full processor. And it assumes—quite naively—that youcould actually predict the value of BP, which I assure you isdifficult, especially for a component trying to schedule heterogeneouswork, much like the CLR's thread pool. When in doubt, it's better torely on the thread pool for scheduling of tasks to OS threads, and tendtowards over-expressing concurrency.
Thereis a natural speedup curve for any algorithm. On this curve, there aretwo particularly interesting points to consider. First, what is theminimum number of tasks that see benefit from parallelizing thecomputation? For small computations, it may be that using a smallnumber of tasks incurs too much overhead (thread creation and cachemisses), but that using a larger number allows the execution to catchup to the sequential version and surpass it. Second, given an infinitenumber of hardware threads, what is the maximum number of tasks you canassign to a problem before beginning to see degradation in performancerather than a continued speedup? All problems reach this point ofdiminishing returns. As you continue to subdivide a problem, eventuallyyou're going to reach the granularity of individual instructions.
A linear speedup would mean that the time it takes to execute a problem with p processors is 1/ pthe time it takes to execute a problem with one processor. Amdahl's Lawtends to limit the ability to achieve such speedups. It says, quitesimply, that the maximum speedup is limited by the amount of sequentialexecution you have remaining after introducing parallelism. Moreformally, this law states that, if S is the percentage of the problem that must remain sequential (it cannot be parallelized) and p is the number of CPUs being used, then the approximate speedup you can expect can be represented as: Copy Code
1/(S + ((1 – S)/p))
As the number of processors grows, this expression approaches 1/ S .Therefore, if you can only parallelize (say) 85 percent of the problem,you will only be able to achieve a speedup of 1/.15 or approximately6.6. Any overhead associated with synchronization and introducingconcurrency tends to contribute to S . In reality, however,there is good news: spreading work across multiple processors alsocarries benefits that are hard to quantify and measure, such asallowing concurrent threads to keep their (separate) caches warm.
Anyalgorithm managing real resources also has to take into accountcross-machine utilization. Software that makes entirely local decisionsto maximize parallelism—especially in a server environment such asASP.NET—can (and will!) lead to chaos as well as an increase incontention for resources, including CPUs. A ForAll-style loop, forexample, might query the processor utilization before deciding theoptimal number of tasks dynamically. Instead of the algorithm used in Figure 3 ,which depends on the System.Environment.ProcessorCount property, youmight consider using the GetFreeProcessors function shown in Figure 4 .
 Figure 4 GetFreeProcessors
Copy Code
private static List utilizationCounters;static void InitCounters(){// Initialize the list to a counter-per-processor:utilizationCounters = new List();for (int i = 0; i < Environment.ProcessorCount; i++){utilizationCounters.Add(new PerformanceCounter("Processor","% Processor Time", i.ToString()));}}private static int GetFreeProcessors(){int freeCount = 0;foreach (PerformanceCounter pc in utilizationCounters){if (pc.NextValue() < 0.80f)freeCount++;}return freeCount;}// ForAll change...int partitions = Math.Min(data.Count, (int)Math.Max(1.0f,(float)GetFreeProcessors() / (1 - blocking)));
Thatalgorithm isn't perfect. It's only a statistical snapshot of themachine state at the time it runs, and says nothing about what happensafter it returns its result. It could be overly optimistic orpessimistic. And of course it doesn't account for the fact that one ofthe processors being queried is the one executing the GetFreeProcessorsfunction itself, which would be a useful improvement. Anotherinteresting statistical metric to look at is the System\Processor QueueLength performance counter, which tells you how many threads are in thescheduling queue waiting for a processor to become free. A large numberhere indicates that any new work you introduce will simply have to waitfor the queue to drain (assuming all threads are of equivalentpriority).
Thereare some interesting reasons to create too much concurrency rather thantoo little. If you're considering heterogeneous tasks, the model ofletting each task execute on a thread until it completes may lead tofairness problems. Task A, which takes substantially longer to run thantask B, could lead to starvation of B if additional resources are notfreed up. This is especially bad if A has decided to block and youralgorithm didn't take that into account.
Anotherreason to intentionally over-parallelize is for asynchronous I/O.Windows provides I/O completion ports for superior scalability, inwhich case outstanding I/O requests don't even need to utilize an OSthread. The I/O begins asynchronously and, once it is complete, Windowsposts a completion packet to the underlying port. Typically anefficiently sized pool of threads is bound to the port—taken care of onthe CLR by the thread pool—waiting to process completion packets oncethey become available. Assuming a sparse completion rate, creating alarge number of I/O requests as fast as possible in parallel can leadto better scalability than if each task waited behind the other for itsturn to initiate the asynchronous I/O. This applies to file, network,and memory-mapped I/O, although you should always be cognizant of thefact that there are a fixed number of shared resources on the machine;competing for them too heavily will only degrade scaling, not enhanceit.
Shared State
Anytime you introduce concurrency, you need to worry about protectingshared state. This is crucial. I recommend you read Vance Morrison'sarticle in the August 2005 issue of MSDN ® Magazine (msdn.microsoft.com/msdnmag/issues/05/08/Concurrency)about why locking is important. Correctness should always takeprecedence over performance, and if you're using concurrency withoutconsidering locking, your code probably isn't correct. I'm not going toreiterate what Vance has already said quite nicely, but rather I amgoing to focus on the performance of such techniques.
The most common techniques for synchronization are locking and low-lock operations. Locking uses primitives like the Win32®Mutex or CRITICAL_SECTION, or the CLR Monitor, ReaderWriterLock, orassociated language keywords (for example, lock in C# and SyncLock inVisual Basic®) to achieve some degreeof mutual exclusion. To achieve this exclusion, a call is made to anAPI; some internal algorithm ensures that no two pieces of code usingthe same lock can enter the protected region of code. So long aseverybody abides by the protocol, code remains safe.
Low-lockoperations can be built using interlocked primitives, which areimplemented via hardware support for atomic load-compare-storeinstructions. They ensure a single update to memory is atomic and canbe used to build highly scalable code that uses optimistic concurrency.Such code is more difficult to write, but it tends not to block.(Locks, in case you are wondering, are written using such primitives.)
But making these calls comes with a cost. Figure 5 shows a micro-benchmark of what acquisition of various types of locks costs in CPU cycles for cases without contention.
Figure 5  Comparing the Cost of Various Locks
Whilesuch measurements are interesting to understand the performanceimplications of locking (especially when dynamic decisions aboutparallel execution are made, in which case a larger amount of code"needs to be ready" than will actually be run in parallel), makingcertain that you synchronize at the right granularity can help toensure costs such as this don't dominate execution of your code. Thereare also costs I have not mentioned, such as the interaction betweeninterlocked operations and the memory hierarchy. Regrettably, spacedoesn't permit that. Nevertheless, the more interesting part is theeffect on scalability. Regrettably you often need to make tradeoffsbetween scalability and sequential straight-line execution performance.These tradeoffs should be informed by measurements.
Nothingguarantees that a thread remains runnable while a lock is held, andtherefore if its quantum expires, a subsequent thread may run andattempt to acquire this same lock. Moreover, a higher priority threadthat becomes runnable can preempt a thread running under a lock. Thiscan cause a phenomenon known as priority inversion, and can also leadto lock convoys if the arrival rate at a contended lock isextraordinarily high. Most locks react to contended acquisitionattempts via some form of lightweight spinning on multi-CPU systems, inthe hope that the thread holding the lock will soon release it. If thatfails, because either the owner is holding it for longer than expectedor perhaps because it was swapped out as a result of a context switch,it will block. For a highly concurrent system, the more blocking, themore threads are needed to keep CPUs busy, and the lower theprobability that your system is going to scale nicely.
Thusan important question to keep in the back of your mind is: how can I dothe least amount of work while I'm holding a lock, in order to minimizethe amount of locking required? Reader/writer locks can help with this,allowing multiple threads to read data while still ensuring mutualexclusion on writes. For most systems, the ratio of readers to writersis very high, and therefore the win to scalability can be huge. JeffreyRichter's Concurrent Affairs column from the June 2006 issue of MSDN Magazine is a great starting point to learn more (seemsdn.microsoft.com/msdnmag/issues/06/06/ConcurrentAffairs).
Withthat said, if you can avoid sharing state in the first place, youneedn't synchronize access at all. A common technique to increasescalability of algorithms that manipulate hot data structures—data thatmost threads must access—is to avoid the use of locks altogether. Thiscan take three interesting forms, in increasing order of difficulty toimplement: immutability, isolation, and lock freedom.
Immutabilitymeans that an instance, once created, will not change—or at least won'tchange during a fixed and well known period of time. A CLR string, forexample, is immutable and therefore doesn't require a lock aroundaccesses to its individual characters. If state isn't in motion, youdon't need to lock. This becomes difficult to achieve when there aremultiple locations containing pointers to state that are supposed to beobserved atomically.
Isolationeliminates any concurrent access to data by maintaining separatecopies. Many thread-safe C implementations of malloc and freeoperations, for example, maintain a per-thread pool of available memoryso that threads allocating don't contend for the pool (which is likelyto be a hot spot in any C program). Similarly, the CLR's Server GarbageCollector (GC) uses a per-thread allocation context and per-CPU segmentof memory to increase throughput of memory allocations. This typicallyrequires periodic rendezvousing with a central copy of data structures,and can sometimes require costs associated with copying and ensuringinteresting bits of data doesn't become stale.
Lockfreedom is such a tricky technique that I will only mention it inpassing. If you really understand the memory model of your targetmachine and are willing to write and maintain a whole lot more code,you can create clever data structures that scale nicely when accessedin parallel. More often than not, the resulting code is so difficult totest for correctness and to maintain that it isn't worth the effort.These techniques are worth investigating for those areas of yourprogram that you've measured a scaling or performance problemassociated with the use of locks.
Tools for Profiling Parallelism
Let'ssee how you might measure and improve the scalability of your code.Throughout this column, I've been rather fuzzy on techniques,approaches, and costs. Unfortunately there isn't one magic formula thatworks for all parallel problems. And similarly, there isn't an easyanswer to the question of how to profile problems and/or identifybetter approaches to attaining a parallel speedup. It's entirelypossible you will go through all of the work I outlined here (and thensome—I haven't discussed debugging), and yet end up no better off thanif you had stuck with a sequential algorithm. There are so-calledembarrassingly parallel problems for which cookbook-like algorithmshave been written and made available online and in textbooks.Unfortunately many real world programs aren't so straightforward.
Here are some tips for profiling your parallel algorithms. All of these make use of the new Visual Studio®2005 profiler. It is built in to the ordinary Visual Studio interface(under the Tools | Performance Tools | New Performance Session menuitem), and also has a command-line version located in the \TeamTools\PerformanceTools\ Visual Studio subdirectory, namedVSPerfCmd.exe. (Seemsdn2.microsoft.com/ms182403.aspxfor usage details about this tool.) This profiler creates VSP filesthat can piped through the VSPerfReport.exe command to create a CSV orXML file for further analysis. Here are a few items to look for.
Ensure your CPUs are busy.If you have low processor utilization, it's likely that one of twothings is happening. Either you didn't use enough processors to keepyour problem busy, or threads are getting backed up waiting for eachother, most likely due to excessive synchronization at hot points inthe code. Task Manager is typically sufficient for this, although theProcessor\% Processor Time performance counter (via PerfMon.exe) canalso be used.
Makesure your program isn't faulting heavily. Especially for data-intensiveapplications, you need to ensure you don't overflow physical memory ona regular basis. In such a case, a system full of threads cancontinually thrash the disk while it constantly has to page in and pageout data. (Remember how expensive disk access is from the chartearlier?) Task Manager can give you this data (you'll need to select itas a column), as can PerfMon.exe. VSPerfCmd can also report this datavia ETW events with this command: Copy Code
VSPerfCmd.exe /events:on,Pagefault /start:SAMPLE /output: /launch:
Then use the following command once the program completes. Copy Code
VSPerfCmd.exe /shutdown
You may also want to play around with the sampling interval.
Identify where your program is spending most of its CPU time.This is especially important if this CPU time occurs while locks areheld. It could also be the case that the additional code required tocreate threads, perform synchronization, and anything associated withthose two things is dominating the execution time.
Examine the System\Context Switches/sec and System\Processor Queue Length performance counters.This can help determine whether you have too many threads, causing timewasted in context switches and thread migration. If so, try tweakingthe algorithm that determines how many tasks to use.
Look for a memory hierarchy and cache problem.If none of the previous suggestions work and it seems that you shouldbe seeing a bigger speedup, you might have a memory hierarchy and cacheproblem. A lot of time spent in cache misses and invalidations candramatically limit the ability to speed up your program. Partitioningdata to be more cache-line friendly and using some of the approachesmentioned above, like isolation, can help to resolve this problem. EachCPU offers a set of performance counters that can be queried in VisualStudio's profiler, covering things like instructions retired and cachemisses. A low instructions-retired count indicates more time spent inhigh latency operations, such as cache misses, and the cache-specificcounters can be used to determine where the misses are occurring and atwhat frequency.
While the exact counters are processor-specific, the Visual Studio interface gives you a nice menu option for using them (see Figure 6 ), and you can also query these counters through the command: Copy Code
VSPerfCmd.exe /QueryCounters
Figure 6  Profiler Properties
Conclusion
Recommended Reading
Writing Faster Managed Code: Know What Things Cost," Jan Gray, MSDN, June 2003
Writing Scalable Applications for Windows NT," John Vert, MSDN, June 1995
Concurrency: What Every Dev Must Know About Multithreaded Apps," Vance Morrison, MSDN Magazine, August 2005
Scalabilityvia parallelism has historically been limited to server-side andhigh-end workstation environments. But as new hardware trends towardsthread-level parallelism, namely multi-core architectures, mainstreamclient-side software will ultimately have to cope with and makeefficient use of the resources available. This brings along with it aunique set of challenges. Clearly parallelism is not a replacement forhighly efficient sequential code, but rather a technique for makingalready optimized sequential algorithms run even faster.
Thiswas a very fast-paced and high-level overview. You'll probably walkaway saying it's way too hard, and you wouldn't be wrong. But thesetechniques will help you to begin architecting code that will scalenicely on newer multi-core processors as they become more ubiquitousover time. As infrastructure like the CLR thread-pool and tools likeVisual Studio evolve to better support this style of programming, youcan expect many of the difficulties to become more manageable.
Send your questions and comments to  clrinout@microsoft.com.
Joe Duffyworks on the CLR team at Microsoft. His focus is on developingconcurrency abstractions, programming models, and infrastructure forthe platform. He just released a book, .NET Framework 2.0 (Wrox/Wiley, 2006), and is in the process of writing another, Concurrent Programming on Windows (Addison-Wesley). He writes frequently on his blog:www.bluebytesoftware.com/blog.