Scalability Best Practices: Lessons from eBay

来源:百度文库 编辑:神马文学网 时间:2024/04/28 14:55:35

Scalability Best Practices: Lessons from eBay

Posted byRandy ShouponMay 27, 2008

At eBay, one of the primary architectural forces we contendwith every day is scalability. It colors and drives every architecturaland design decision we make. With hundreds of millions of usersworldwide, over two billion page views a day, and petabytes of data inour systems, this is not a choice - it is a necessity.

In a scalable architecture, resource usage should increase linearly(or better) with load, where load may be measured in user traffic, datavolume, etc. Where performance is about the resource usage associatedwith a single unit of work, scalability is about how resource usagechanges as units of work grow in number or size. Said another way,scalability is the shape of the price-performance curve, as opposed toits value at one point in that curve.

There are many facets to scalability - transactional, operational,development effort. In this article, I will outline several of the keybest practices we have learned over time to scale the transactionalthroughput of a web-based system. Most of these best practices will befamiliar to you. Some may not. All come from the collective experienceof the people who develop and operate the eBay site.

Best Practice #1: Partition by Function

Whether you call it SOA, functional decomposition, or simply goodengineering, related pieces of functionality belong together, whileunrelated pieces of functionality belong apart. Further, the moredecoupled that unrelated functionality can be, the more flexibility youwill have to scale them independently of one another.

At the code level, we all do this all the time. JAR files, packages,bundles, etc., are all mechanisms we use to isolate and abstract oneset of functionality from another.

At the application tier, eBay segments different functions intoseparate application pools. Selling functionality is served by one setof application servers, bidding by another, search by yet another. Intotal, we organize our roughly 16,000 application servers into 220different pools. This allows us to scale each pool independently of oneanother, according to the demands and resource consumption of itsfunction. It further allows us to isolate and rationalize resourcedependencies - the selling pool only needs to talk to a relatively smallsubset of backend resources, for example.

At the database tier, we follow much the same approach. There is nosingle monolithic database at eBay. Instead there is a set of databasehosts for user data, a set for item data, a set for purchase data, etc. -1000 logical databases in all, on 400 physical hosts. Again, thisapproach allows us to scale the database infrastructure for each type ofdata independently of the others.

Best Practice #2: Split Horizontally

While functional partitioning gets us part of the way, by itself itis not sufficient for a fully scalable architecture. As decoupled asone function may be from another, the demands of a single functionalarea can and will outgrow any single system over time. Or, as we liketo remind ourselves, "if you can't split it, you can't scale it."Within a particular functional area, then, we need to be able to breakthe workload down into manageable units, where each individual unitretains good price-performance. Here is where the horizontal splitcomes in.

At the application tier, where eBay's interactions are by designstateless, splitting horizontally is trivial. Use a standardload-balancer to route incoming traffic. Because all applicationservers are created equal and none retains any transactional state, anyof them will do. If we need more processing power, we simply add moreapplication servers.

The more challenging problem arises at the database tier, since datais stateful by definition. Here we split (or "shard") the datahorizontally along its primary access path. User data, for example, iscurrently divided over 20 hosts, with each host containing 1/20 of theusers. As our numbers of users grow, and as the data we store for eachuser grows, we add more hosts, and subdivide the users further. Again,we use the same approach for items, for purchases, for accounts, etc.Different use cases use different schemes for partitioning the data:some are based on a simple modulo of a key (item ids ending in 1 go toone host, those ending in 2 go to the next, etc.), some on a range ofids (0-1M, 1-2M, etc.), some on a lookup table, some on a combination ofthese strategies. Regardless of the details of the partitioningscheme, though, the general idea is that an infrastructure whichsupports partitioning and repartitioning of data will be far morescalable than one which does not.

Best Practice #3: Avoid Distributed Transactions

At this point, you may well be wondering how the practices ofpartitioning data functionally and horizontally jibe with transactionalguarantees. After all, almost any interesting operation updates morethan one type of entity - users and items come to mind immediately. Theorthodox answer is well-known and well-understood - create adistributed transaction across the various resources, using two-phasecommit to guarantee that all updates across all resources either occuror do not. Unfortunately, this pessimistic approach comes withsubstantial costs. Scaling, performance, and latency are adverselyaffected by the costs of coordination, which worsens geometrically asyou increase the number of dependent resources and incoming clients.Availability is similarly limited by the requirement that all dependentresources are available. The pragmatic answer is to relax yourtransactional guarantees across unrelated systems.

It turns out that you can't have everything. In particular,guaranteeing immediate consistency across multiple systems or partitionsis typically neither required nor possible. The CAP theorem,postulated almost 10 years ago by Inktomi's Eric Brewer, states that ofthree highly desirable properties of distributed systems - consistency(C), availability (A), and partition-tolerance (P) - you can only choosetwo at any one time. For a high-traffic web site, we have to choosepartition-tolerance, since it is fundamental to scaling. For a 24x7 website, we typically choose availability. So immediate consistency has togive way.

At eBay, we allow absolutely no client-side or distributedtransactions of any kind - no two-phase commit. In certain well-definedsituations, we will combine multiple statements on a single databaseinto a single transactional operation. For the most part, however,individual statements are auto-committed. While this intentionalrelaxation of orthodox ACID properties does not guarantee immediateconsistency everywhere, the reality is that most systems are availablethe vast majority of the time. Of course, we do employ varioustechniques to help the system reach eventual consistency: carefulordering of database operations, asynchronous recovery events, andreconciliation or settlement batches. We choose the technique accordingto the consistency demands of the particular use case.

The key takeaway here for architects and system designers is thatconsistency should not be viewed as an all or nothing proposition. Mostreal-world use cases simply do not require immediate consistency. Justas availability is not all or nothing, and we regularly trade it offagainst cost and other forces, similarly our job becomes tailoring theappropriate level of consistency guarantees to the requirements of aparticular operation.

Best Practice #4: Decouple Functions Asynchronously

The next key element to scaling is the aggressive use of asynchrony.If component A calls component B synchronously, A and B are tightlycoupled, and that coupled system has a single scalability characteristic-- to scale A, you must also scale B. Equally problematic is itseffect on availability. Going back to Logic 101, if A implies B, thennot-B implies not-A. In other words, if B is down then A is down. Bycontrast, if A and B integrate asynchronously, whether through a queue,multicast messaging, a batch process, or some other means, each can bescaled independently of the other. Moreover, A and B now haveindependent availability characteristics - A can continue to moveforward even if B is down or distressed.

This principle can and should be applied up and down aninfrastructure. Techniques like SEDA (Staged Event-Driven Architecture)can be used for asynchrony inside an individual component whileretaining an easy-to-understand programming model. Between components,the principle is the same -- avoid synchronous coupling as much aspossible. More often than not, the two components have no businesstalking directly to one another in any event. At every level,decomposing the processing into stages or phases, and connecting them upasynchronously, is critical to scaling.

Best Practice #5: Move Processing To Asynchronous Flows

Now that you have decoupled asynchronously, move as much processingas possible to the asynchronous side. In a system where replying rapidlyto a request is critical, this can substantially reduce the latencyexperienced by the requestor. In a web site or trading system, it isworth it to trade off data or execution latency (how quickly we geteverything done) for user latency (how quickly the user gets aresponse). Activity tracking, billing, settlement, and reporting areobvious examples of processing that belongs in the background. Butoften significant steps in processing of the primary use case canthemselves be broken out to run asynchronously. Anything that can waitshould wait.

Equally as important, but less often appreciated, is the fact thatasynchrony can substantially reduce infrastructure cost. Performingoperations synchronously forces you to scale your infrastructure for thepeak load - it needs to handle the worst second of the worst day atthat exact second. Moving expensive processing to asynchronous flows,though, allows you to scale your infrastructure for the average loadinstead of the peak. Instead of needing to process all requestsimmediately, the queue spreads the processing over time, and therebydampens the peaks. The more spiky or variable the load on your system,the greater this advantage becomes.

Best Practice #6: Virtualize At All Levels

Virtualization and abstraction are everywhere, following the oldcomputer science aphorism that the solution to every problem is anotherlevel of indirection. The operating system abstracts the hardware. Thevirtual machine in many modern languages abstracts the operating system.Object-relational mapping layers abstract the database. Load-balancersand virtual IPs abstract network endpoints. As we scale ourinfrastructure through partitioning by function and data, an additionallevel of virtualization of those partitions becomes critical.

At eBay, for example, we virtualize the database. Applicationsinteract with a logical representation of a database, which is thenmapped onto a particular physical machine and instance throughconfiguration. Applications are similarly abstracted from the splitrouting logic, which assigns a particular record (say, that of user XYZ)to a particular partition. Both of these abstractions are implementedin our home-grown O/R layer. This allows the operations team torebalance logical hosts between physical hosts, by separating them,consolidating them, or moving them -- all without touching applicationcode.

We similarly virtualize the search engine. To retrieve searchresults, an aggregator component parallelizes queries over multiplepartitions, and makes a highly partitioned search grid appear to clientsas one logical index.

The motivation here is not only programmer convenience, but alsooperational flexibility. Hardware and software systems fail, andrequests need to be re-routed. Components, machines, and partitions areadded, moved, and removed. With judicious use of virtualization, higherlevels of your infrastructure are blissfully unaware of these changes,and you are therefore free to make them. Virtualization makes scalingthe infrastructure possible because it makes scaling manageable.

Best Practice #7: Cache Appropriately

The last component of scaling is the judicious use of caching. Thespecific recommendations here are less universal, because they tend tobe highly dependent on the details of the use case. At the end of theday, the goal of an efficient caching system to maximize your cache hitratio within your storage constraints, your requirements foravailability, and your tolerance for staleness. It turns out that thisbalance can be surprisingly difficult to strike. Once struck, ourexperience has shown that it is also quite likely to change over time.

The most obvious opportunities for caching come with slow-changing,read-mostly data - metadata, configuration, and static data, forexample. At eBay, we cache this type of data aggressively, and use acombination of pull and push approaches to keep the system reasonably insync in the face of updates. Reducing repeated requests for the samedata can and does make a substantial impact. More challenging israpidly-changing, read-write data. For the most part, we intentionallysidestep these challenges at eBay. We have traditionally not done anycaching of transient session data between requests. We similarly do notcache shared business objects, like item or user data, in theapplication layer. We are explicitly trading off the potential benefitsof caching this data against availability and correctness. It should benoted that other sites do take different approaches, make differenttradeoffs, and are also successful.

Not surprisingly, it is quite possible to have too much of a goodthing. The more memory you allocate for caching, the less you haveavailable to service individual requests. In an application layer whichis often memory-constrained, this is a very real tradeoff. Moreimportantly, though, once you have come to rely on your cache, and havetaken the extremely tempting steps of downsizing the primary systems tohandle just the cache misses, your infrastructure literally may not beable to survive without it. Once your primary systems can no longerdirectly handle your load, your site's availability now depends on 100%uptime of the cache - a potentially dangerous situation. Even somethingas routine as rebalancing, moving, or cold-starting the cache becomesproblematic.

Done properly, a good caching system can bend your scaling curvebelow linear - subsequent requests retrieve data cheaply from cacherather than the relatively more expensive primary store. On the otherhand, caching done poorly introduces substantial additional overhead andavailability challenges. I have yet to see a system where there are notsignificant opportunities for caching. The key point, though, is tomake sure your caching strategy is appropriate for your situation.

Summary

Scalability is sometimes called a "non-functional requirement,"implying that it is unrelated to functionality, and strongly implyingthat it is less important. Nothing could be further from the truth.Rather, I would say, scalability is a prerequisite to functionality - a"priority-0" requirement, if ever there was one.

I hope that you find the descriptions of these best practices useful,and that they help you to think in a new way about your own systems,whatever their scale.

References

  • eBay's Architectural Principles (video)
  • Werner Vogels on scalability
  • Dan Pritchett on You Scaled Your What?
  • The Coming of the Shard
  • Trading Consistency for Availability in Distributed Architectures
  • Eric Brewer on the CAP Theorem
  • SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

About the Author

Randy Shoup is a Distinguished Architect at eBay. Since 2004, he hasbeen the primary architect for eBay's search infrastructure. Prior toeBay, he was the Chief Architect at Tumbleweed Communications, and hasalso held a variety of software development and architecture roles atOracle and Informatica.

He presents regularly at industry conferences on scalability and architecture patterns.