[Bernstein09] 9.4. Single-Master Primary-Copy Replication

来源:百度文库 编辑:神马文学网 时间:2024/04/30 21:31:47

9.4. Single-Master Primary-Copy Replication

Normal Operation

Themost straightforward, and often pragmatic, approach to replication isto designate one replica as the primary copy and to allow updatetransactions to read and write only that replica. This is theprimary-backup technique illustrated in Figure 9.1. Updates on the primary are distributed to other replicas, called secondaries, in the order in which they executed at the primary and are applied to secondaries in that order (see Figure 9.7).Thus, all replicas process the same stream of updates in the sameorder. In between any two update transactions, a replica can process alocal query.

Figure 9.7. Propagating Updates from Primary to Secondaries. Transactionsupdate data only at the primary replica. The primary propagates updatesto the secondary replicas. The secondaries can process local queries.


Oneway to propagate updates is by synchronous replication. For example, ina relational database system, one could define an SQL trigger on theprimary table that remotely updates secondary copies of the tablewithin the context of the user’s update transaction. This implies thatupdates are propagated right away, which may delay the completion ofthe transaction. It also means that administrators cannot control whenupdates are applied to replicas. For example, in some decision supportsystems, it is desirable to apply updates at fixed times, so thedatabase remains unchanged when certain analysis work is in progress.

Currently,the more popular approach is asynchronous replication, where updates tothe primary generate a stream of updates to the secondaries, which isprocessed after transactions on the primary commit. For databasesystems, the stream of updates is often a log. The log reflects theexact order of the updates that were performed at the primary, so theupdates can be applied directly to each secondary as they arrive.

One application of primary-copy replication is database mirroring,where there are only two replicas, one primary and one secondary. Thisis a hot backup technique for high availability. Among the firstgeneral-purpose systemsto do this were IBM’s IMS/XRF and Tandem’s (now HP’s) Non-Stop SQLdatabase systems. Now most database systems offer a database mirroringfeature.

Withdatabase mirroring, the primary sends its database log to thesecondary. The secondary continually runs log-based recovery, so thatits database state is very close to that of the primary. If the primaryfails, the secondary just needs to finish processing the tail of thelog it received before the primary failed, after which it can take overas primary. If synchronous replication is used, then no transactionsare lost in this failover. With asynchronous replication, the secondarymay not have received the last few updates from the primary. Thisproblem can be mitigated if the primary and secondary are colocated andthe secondary can be given access to the primary copy’s disk log. Inthat case, after the secondary is given control, it can read the disklog to pick up the last few log records that it did not receive beforethe primary failed.

Anotherapplication of primary-copy replication is to produce queryable copiesof parts of a database. This is a functionally-rich feature that isoffered by most relational database products. In this case, there is noreal-time requirement to move the log records immediately to thesecondary copy, so there is time to postprocess updates from theprimary copy in various ways.

Somerelational database systems capture updates to each primary table in alog table that is colocated with the primary table (see Figure 9.8b).One approach is to have the system define an SQL trigger on eachprimary table that translates each update into an insert on the logtable. Periodically, the primary creates a new log table to captureupdates and sends the previous log table to each secondary where it isapplied to the replica. This approach to capturing updates can slowdown normal processing of transactions, due to the extra workintroduced by the trigger.

Figure 9.8. Generating Update Streams for Replicas. An update stream for replicas can be produced from (a) the database system’s log or (b) a log table produced by triggers.


Anotherapproach is to postprocess the database log to create the stream ofupdates to the replicas. If this is done on-line while the log is stillin main memory, then it avoids slowing down normal processing oftransactions as compared to the trigger approach. In fact, the log canbe sent to a different server, where the postprocessing is done.

Sincethe log can be quite large, one reason to postprocess the log is toreduce its size if possible. One technique is to filter out abortedtransactions, since they do not need to be applied to replicas (see Figure 9.8a).This reduces the amount of data transmission and the cost of processingupdates at the replica. However, it requires that the primary not senda log record until it knows that the transaction that wrote the recordhas committed. This introduces additional processing time at theprimary and delay in updating the secondary, which are the main costsof reducing the data transmission. Another technique is to send onlythe finest granularity data that has changed, e.g., fields of records,rather than coarser-grain units of data, such as entire records.

Anotherreason to postprocess the log is to group together the updates of eachtransaction. This is beneficial because it enables transactions to beapplied to secondaries serially. After each transaction is applied, thesecondary database is in a consistent state. Therefore a query can reada consistent database state without using read locks. If updates werenot grouped by transaction, then in order for queries to read aconsistent state, updates would have to set write locks and querieswould have to set read locks.

Somesystems allow different parts of the database to be replicated todifferent locations. For example, the primary might contain a tabledescribing customers and other tables describing orders. These tablesare colocated at the primary, since many transactions require access toall of these tables. However, the customer table may be replicated atdifferent servers than the order tables. To enable this, the logpostprocessing splits each transaction’s updates into two transactions,one for the customer table and one for the order tables, and adds themto separate streams, one for the customer replicas and one for theorder replicas. If there is also a replica that contains all thecustomer and orders information, then the log postprocessor wouldgenerate a third stream for that replica, with all the updates of eachtransaction packaged in a single transaction in the stream.

Giventhis complex filtering and transaction splitting, often a “bufferdatabase” is used to store updates that flow from the primary tosecondaries. Updates that are destined for different replicas arestored in different areas of the buffer database. This allows them tobe applied to replicas according to different schedules.

Somesystems allow application-specific logic to be used to apply changes toreplicas. For example, the application could add a timestamp that tellsexactly when the update was applied to the replica.

Althoughprimary-copy replication normally does not allow transactions to updatea secondary before updating the primary, there are situations where itcan be made to work. For example, consider an update transaction thatreads and writes a replica using two-phase locking. Suppose it keeps acopy of all the values that it read, which includes all the data itemsthat the transaction wrote. When it is ready to commit, it sends thevalues of data items that it read along with values that it wrote tothe primary. Executing within the context of the same transaction, theprimary reads the same data items that the transaction read at thesecondary, setting locks as in normal two-phase locking. If the valuesof the data items that it reads at the primary are the same as thosethat the transaction read at the secondary, then the transactionapplies its updates to the primary too, and commits. If not, then itaborts. This is essentially an application of the optimisticconcurrency control technique described in Section 6.8.

Mostdatabase systems offer considerable flexibility in configuringreplication. Subsets of tables can be independently replicated,possibly at different locations. For example, a central office’sAccounts table can be split by branch, and the accounts for each branchare replicated at the system at that branch. As the number ofreplicated tables grows, it can be rather daunting to keep track ofwhich pieces of which tables are replicated at which systems. Tosimplify management tasks, systems offer tools for displaying,querying, and editing the configuration of replicas.

Thereplication services of most database systems work by constructing alog stream or log table of updates and sending it to secondary servers.This approach was introduced in Tandem’s (now HP’s) Non-Stop SQL and inDigital’s VAX Data Distributor in the 1980s. Similar approachescurrently are offered by IBM, Informix (now IBM), Microsoft (SQLServer), MySQL, Oracle, and Sybase. Within this general approach,products vary in the specific features they offer: the granularity ofdata that can be replicated (a database, a table, a portion of atable); the flexibility of selecting primaries and secondaries (aserver can be a primary server for some data and a secondary forothers); how dynamically the configuration of primaries and secondariescan be changed; the options for filtering updates and splittingtransactions; and facilities to simplify managing a large set ofreplicas.

Failures and Recoveries

Thisprimary-copy approach works well as long as the primary and secondariesare alive. How do we handle failures? Let us work through the cases.

Secondary Recovery

Ifa secondary replica fails, the rest of the system continues to run asbefore. When the secondary recovers, it needs to catch up processingthe stream of updates from the primary. This is not much different thanthe processing it would have done if it had not failed; it’s justprocessing the updates later. The main new problem is that it mustdetermine which updates it processed before it failed, so it doesn’tincorrectly reapply non-idempotent updates. This is the same problem aslog-based database recovery that we studied in Chapter 7.

Ifa secondary is down for too long, it may be more efficient to get awhole new copy of the database rather than processing an update stream.In this case, while the database is being copied from the primary tothe recovering secondary, more updates are generated at the primary. Soafter the database has been copied, to finish up, the secondary needsto process that last stream of updates coming from the primary. This issimilar to media recovery, as described in Chapter 7.

Primary Recovery with One Secondary

Ifthe primary fails, recovery can be more challenging. One could simplydisallow updates until the primary recovers. This is a satisfactoryapproach when the main goal of replication is better performance forqueries. In fact, it may be hard to avoid this approach if complexfiltering and partitioning of updates is supported. Since differentsecondaries receive a different subset of the changes that were appliedto the primary, secondaries are often not complete copies of theprimary. Therefore, it would be difficult to determine whichsecondaries should take over as primary for which parts of theprimary’s data.

Ifa goal of replication is improved availability for updates, then it isusually not satisfactory to wait for the primary to recover, since theprimary could be down for awhile. So if it is important to keep thesystem running, some secondary must take over as primary. This leads totwo technical problems. First, all replicas must agree on the selectionof the new primary, since the system cannot tolerate having twoprimaries—this would lead to total confusion and incorrect results.Second, the last few updates from the failed primary may not havereached all replicas. If a replica starts processing updates from thenew primary before it has received all updates from the failed primary,it will end up in a different state than other replicas that didreceive all the failed primary’s updates.

Wefirst explore these problems in a simple case of two replicas, oneprimary and one secondary. Suppose the secondary detects that theprimary has failed. This failure detection must be based on timeouts.For example, the secondary is no longer receiving log records from theprimary. And when the secondary sends “are you there?” messages to theprimary, the secondary receives no reply from the primary. However,these timeouts may be due to a communications failure between theprimary and secondary, similar to the one shown in Figure 9.7, and the primary may still be operating.

Todistinguish between a primary failure and a primary-secondarycommunications failure, an external agent is needed to decide whichreplica should be primary. A typical approach is to add a “watchdog”process, preferably on a different machine than the primary andsecondary. The watchdog sends periodic “are you there?” messages toboth the primary and secondary. There are four cases to consider (see Figure 9.9):

  1. If the watchdog can communicate with the primary and not with the secondary, then it tells the primary of this fact. If the primary can communicate with the secondary, then no action is needed. If not, then the primary creates another secondary, if possible.

  2. If the watchdog can communicate with both the primary and secondary, but they cannot communicate with each other, then they notify the watchdog of this fact. The watchdog then tells the secondary to fail, since it can no longer function as a replica. It also tells the primary to create another secondary, if possible.

  3. If the watchdog can communicate only with the secondary, then it tells the secondary that it believes the primary is down. If the secondary can communicate with the primary, then no action is needed. If not, then it can take over as primary. In this case, if the primary is still operational but is simply unable to communicate with the watchdog, then the primary must self-destruct. Otherwise, the old primary and the new primary (which was formerly the secondary) are both operating as primary. It may therefore be advisable that the watchdog send a message to tell the primary to self-destruct, in case the primary is able to receive messages from the watchdog but its replies are not getting through. In summary, if the secondary loses communications with the primary, then whichever replica can still communicate with the watchdog is now the primary.

  4. If neither replica can communicate with the watchdog or with each other, then neither replica can operate as the primary. This is called a total failure.

Figure 9.9. Failure Cases with a Watchdog. A watchdog process can help sort out failure cases between a primary and secondary.


Supposethat the primary did indeed fail and the secondary has been designatedto be the new primary. Now we face the second problem: the new primarymay not have received all the committed updates performed by the formerprimary before the former primary failed. One solution to this problemis to have the primary delay committing a transaction’s updates untilit knows that the secondary received those updates. The primary couldwait until the secondary has stored those updates in stable storage, orit could wait only until the secondary has received the updates in mainmemory. If the system that stores the secondary has battery backup,then the latter might be reliable enough. In either case, we’re back tosynchronous replication, where the updates to the replica are includedas part of the transaction. This extra round of commit-time messagesbetween the primary and secondary is essentially a simple two-phasecommit protocol. The performance degradation from these messages can besignificant. The choice between performance (asynchronous replication)and reliability (synchronous replication) depends on the applicationand system configuration. Therefore, database products that offerdatabase mirroring usually offer both options, so the user can chooseon a case-by-case basis.

Primary Recovery with Multiple Secondaries

Nowlet’s look at the more general case where there are multiplesecondaries and a secondary detects the failure of a primary. There aretwo ways this could happen. The primary might indeed be down. Or, muchworse, there could be a communication failure that partitions thenetwork into independent sets of functioning replicas. Inthe latter case the primary could still be operational, so the set ofreplicas that doesn’t include the primary must not promote one of thesecondaries to be a primary. The same problem can arise even in thefirst case where the primary is down. In this case, we do want topromote one of the secondaries to become primary. But if there are twoindependent sets of replicas that are operating, each set mightindependently promote a secondary to be the primary, a situation thatwe want to avoid.

Tosolve this problem, we need a decision criterion by which at most oneset of replicas has an operational primary. We need an algorithm bywhich replicas can reach this decision. And after the replicas havechosen a new primary, we need an algorithm by which they can recover tothe latest state before accepting new transactions. The next threesections treat each of these problems in turn.

Majority and Quorum Consensus

Onesimple way to ensure that only one primary exists is to staticallydeclare one replica to be the primary. If the network partitions, thepartition that has the primary is the one that can process updates.This is a feasible approach, but it is useless if the goal is highavailability. If the primary is down, each partition has to assume theworst, which is that the primary really is running but notcommunicating with this partition. Thus, neither partition promotes asecondary to become primary.

A more flexible algorithm for determining which partition can have the primary is called majority consensus: a set of replicas is allowed to have a primary if and only if the set includes a majority of the replicas (see Figure 9.10).Since a majority is more than half, only one set of replicas can have amajority. Moreover, each partition can independently figure out if ithas a majority. These are the two critical properties of majoritiesthat make the technique work.

Figure 9.10. Majority Consensus. Partition 1 has a majority of the replicas and therefore is allowed to process updates. Partition 2 may not process updates.


Majorityconsensus is a generalization of the watchdog technique we describedfor database mirroring. The watchdog adds a third process to the mix.Two communicating processes comprise a majority. Thus, whicheverpartition has at least two communicating processes is allowed to havethe primary: either the existing primary and secondary if the watchdogis down; or the watchdog plus whichever replica(s) it can communicatewith. By convention, if the watchdog can communicate with the primaryand secondary but the latter cannot communicate with each other, thenthe secondary is told to fail.

Majorityconsensus does have one annoying problem: it does not work well whenthere is an even number of copies. In particular, it is useless whenthere are just two replicas, since the only majority of two is two;that is, it can operate only when both replicas are available. Whenthere are four replicas, a majority needs at least three, so if thenetwork splits into two groups of two copies, neither group can have aprimary.

A fancier approach is the quorum consensus algorithm. It gives a weight to each replica and looks for a set of replicas with a majority of the weight, called a quorum (see Figure 9.11).For example, with two replicas, we could give a weight of two to themore reliable replica and a weight of one to the other. That way, thereplica with a weight of two can be primary even if the other replicais unavailable. Giving a weight of two to the most reliable replicahelps whenever there is an even number of replicas. If the networkpartitions into two groups with the same number of copies, the groupwith the replica of weight two still has a quorum.

Figure 9.11. Quorum Consensus. Partition1 has a total weight of 4, which is more than half of the total weightof 7. It therefore constitutes a quorum and is allowed to processupdates.


Reaching Consensus

Duringnormal operation, the set of operational replicas must agree on whichreplicas are up and which are down or unreachable. If a replica losescommunication with one or more other replicas, then the operationalreplicas need to reassess whether they still have a majority. (For thepurpose of this discussion, we’ll assume majority consensus, not quorumconsensus.) In fact, the nonoperational replicas that are up also needto do this when they reestablish communications with a replica, sincethis replica may be the one they need to reach a majority. After somegroup of replicas is established as having a majority, that group canchoose a primary and ensure that all replicas in the group have themost up-to-date state.

To discuss the details, we need some terminology: The replica set is the set of all replicas, including those that are up and down; the current configuration is the set of operational replicas that are able to communicate with each other and comprise a majority.

An algorithm that enables a set of processes to reach a common decision is called a consensus algorithm.In this case, that common decision is agreement on the currentconfiguration by a set of operational replicas. Given our problemcontext, we’ll call the participants replicas instead of processes. Butthe algorithm we describe works for general consensus, not just fordeciding on the current configuration.

Oneproblem with such consensus algorithms is that multiple replicas may betrying to drive a common decision at the same time. It’s important thatdifferent replicas don’t drive the replicas toward different decisions.

Anotherproblem is that the system may be unstable, with replicas andcommunications links failing and recovering while replicas are tryingto reach consensus. There’s not much hope in reaching consensus duringsuch unstable periods. However, once the system stabilizes, we do wantthe algorithm to reach consensus quickly.

Thereare several variations of algorithms to reach consensus, but they allhave a common theme, namely, that there’s a unique identifierassociated with the consensus, that these identifiers are totallyordered, and that the highest unique identifier wins. We will call thatidentifier an epoch number. It identifies a period of time, called an epoch, during which a set of replicas have agreed on the current configuration, called an epoch set.An epoch number can be constructed by concatenating a counter valuewith the unique replica identifier of the replica that generated theepoch number. Each replica keeps track of the current epoch number e in stable storage.

Duringstable periods, the epoch set with largest epoch number is the currentconfiguration. During unstable periods, the actual currentconfiguration may differ from the current epoch set. The goal of theconsensus algorithm is to reach agreement on a new epoch set withassociated epoch number that accurately describes the currentconfiguration.

Suppose a replica R is part of the current configuration, which has epoch number e1.If R detects that the current configuration is no longer valid (becauseR has detected a failure or recovery), R becomes the leader of a newexecution of the consensus algorithm, which proceeds as follows:

  1. R generates a new epoch number e2 that is bigger than e1. For example, it increments the counter value part of e1 by one and concatenates it with R’s replica identifier.

  2. R sends an invitation message containing the value e2 to all replicas in the replica set.

  3. When a replica R′ receives the invitation, it replies to R with an accept message if R′ has not accepted another invitation with an epoch number bigger than e2. R′ includes its current epoch number in the accept message. Moreover, if R′ was the leader of another instance of the consensus algorithm (which is using a smaller epoch number), it stops that execution. Otherwise, if R′ has accepted an invitation with an epoch number bigger than e2, it sends a reject message to R. As a courtesy, it may return the largest epoch number of any invitation it has previously accepted.

  4. R waits for its timeout period to expire, to ensure it receives as many replies as possible.

    1. If R receives accept messages from at least one less than a majority of replicas in the replica set, then it has established a majority (including itself) and therefore has reached consensus. It therefore sends a new epoch message to all the accepting replicas and stops. The new epoch message contains the new epoch number and epoch set. When a replica receives a new epoch message, it updates its epoch number and the associated list of replicas in the epoch set and writes it to stable storage.

    2. Otherwise, R has failed to reach a majority and stops.

Let’sconsider the execution of this algorithm under several scenarios.First, assume that only one leader R is running this algorithm. Then itwill either receive enough accept messages to establish a majority andhence a new epoch set, or it will fail to reach a majority.

Suppose a leader R1 fails to establish an epoch set. One reason this could happen is that R1 may be unable to communicate with enough replicas to establish a majority. In this case, R1periodically could attempt to re-execute the algorithm, in case areplica or communication link has silently recovered and thereby madeit possible for R1 to form a majority.

A second reason that R1 may fail to establish an epoch set is that another replica R2 is concurrently trying to create a new epoch set using a higher epoch number. In this case, it is important that R1 not rerun the algorithm right away with a larger epoch number, since this might kill R2’schance of getting a majority of acceptances. That is, it might turninto an “arms race,” where each replica reruns the algorithm withsuccessively higher epoch numbers and thereby causes the otherreplica’s consensus algorithm to fail.

The arms race problem notwithstanding, if R1fails to establish an epoch set and, after waiting awhile, receives noother invitations to join an epoch set with higher epoch number, thenit may choose to start another round of the consensus algorithm. In the previous round, if it received a reject message with a higher epoch number e3, then it can increase its chances of reaching consensus by using an epoch number even higher than e3. This ensures that any replica that is still waiting for the result of the execution with epoch number e3 will abandon waiting and choose the new, higher epoch number instead.

Establishing the Latest State

Afterthe current configuration has been established as the epoch set, theprimary needs to be selected and all the replicas in the currentconfiguration have to be brought up to date. The first step is todetermine if the new epoch set includes the primary from the previousepoch. To do this, first observe that since every epoch set has amajority, it overlaps every earlier epoch set. Therefore, there is atleast one replica in the new epoch set from the previous epoch set andit has the largest epoch number less than the new epoch number. If oneof the replicas with the largest previous epoch number was the primaryof that epoch set, then we can simplify recovery by reusing it as theprimary of the new epoch set.

Unfortunately,this may not be right, because the last epoch may not have stabilizedbefore it lost its majority and had to reconfigure. If that happened,then the primary of the previous epoch set may not have the lateststate. That is, the previous epoch set may have elected the primary butnot yet refreshed the new primary’s state to be the latest state knownto all replicas in the epoch set. To avoid this outcome, the new epochset needs to identify the last stableepoch set. This can be done by having each epoch use a state bit thatit sets after it has stabilized and ensured that every replica in thereplica set has the latest state. Only then can the epoch set acceptnew work.

Therefore,the new epoch set should determine if it includes the primary of thelast stable epoch set. If so, then it knows that this primary has themost up-to-date state. So to resume normal processing, the primaryneeds to ensure the secondaries are up to date by determining the stateof each secondary and sending it whatever updates it is missing. Itthen sets the epoch’s state bit to stable and broadcasts that to allsecondaries.

Ifthe epoch set does not include the primary from the previous epoch,then a new primary must be selected. The choice of primary may be basedon the amount of spare capacity on its machine (since a primaryconsumes more resources than a secondary) and on whether it was amember of the most recent epoch set and thus has the latest or a veryrecent state (so that it can recover quickly and start accepting newrequests).

Thelatest state of the secondaries that are still alive can be determinedby comparing the sequence numbers of the last message received by eachsecondary from the previous primary. The one with highest sequencenumber has the latest state and can forward the tail of its updatesequence to other secondaries that need it. After a replica receivesthat state, it acknowledges that fact to the primary. After the newprimary receives acknowledgments from all replicas in the epoch set, itcan set the epoch’s state to stable and start processing newtransactions. The new primary should then start off with a messagesequence number greater than that of the largest received by anysecondary in the previous epoch.

Doesthe new epoch set actually have the latest state? To answer thisquestion, let C be the set of replicas that were in both the previousepoch and the new one. Since each epoch set has a majority of thereplicas, C must include at least one replica. The replicas in C arethe only ones that might know the latest state. However, as in the caseof database mirroring, it’s possible that none of them actually do knowthe latest state, due to the delay in propagating updates from theprimary to the replicas. For example, suppose the epoch set for epoch 1had replicas P, S1, and S2, with P as the primary. Suppose the last transaction was committed by P and S1, but not S2. Then they all died, and epoch set 2 was formed, consisting of replicas S2, S3, and S4. Epoch sets 1 and 2 overlap by one replica, S2, but S2 doesn’t have the latest state.

Weencountered this problem when considering secondary recovery fordatabase mirroring. The solution we offered was to propagate updatessynchronously. In that case, two-phase commit is needed. This ensures thatevery replica in C has all the committed updates. However, the last fewupdates might be in their uncertainty periods and hence blocked. Thus,while synchronous replication reduces the number of transactions thatmight be lost when a secondary takes over as primary, it doesn’t closethe gap entirely.

Consistency, Availability, and Partition-Tolerance

Indistributed systems, there is an inherent tradeoff between dataconsistency, system availability, and tolerance to network partitions.A system can offer any two of these three properties, but not all threeof them. This is known as the CAP conjecture.

Theprimary-copy approach with synchronous replication ensures dataconsistency and partition-tolerance, and therefore gives up onavailability in some cases. It attains data consistency by writingupdates to replicas as part of the transaction that performed the writeand using two-phase commit for transaction atomicity. It attainspartition-tolerance by using quorum consensus to ensure that there arenot two partitions that are both able to run transactions. This leadsto a loss of availability in the case where the network partitions,because some operational replicas are not part of the quorum.Therefore, even though they are up and running, they are not available.

Supposethe network partitions and the partition that has a quorum of replicasdoes not include the former primary. Although the system can ensure theupdates are permitted only on the quorum of copies, it cannot guaranteeconsistency because the last few transactions that executed at theformer primary may not have arrived at any of the replicas in thequorum before the network partition occurred. Thus, a decision to allowupdates to the quorum of replicas is trading off consistency foravailability. A decision to disallow updates to the quorum of replicasis making the opposite tradeoff, namely, trading off availability inorder to ensure consistency.

Anotheraspect of this tradeoff is eventual consistency versus instantaneousconsistency. Asynchronous replication ensures eventual consistency butgives up on instantaneous consistency, since there may be a long delaybefore updates are propagated to some replicas. The weaker level ofconsistency improves performance by avoiding two-phase commit. It mayalso improve availability, by allowing a user to be redirected from onereplica to another in a slightly different state.

Forexample, suppose an on-line shopper has started populating her shoppingbasket and the shopping basket is replicated using primary-copyreplication. Suppose the primary fails and is not present in thequorum. To maximize availability, the system could service readrequests using another replica while the replicas are being brought upto date. Thus, during this period, the shopper might be given an olderstate of his or her shopping cart. This may occur even if the lastupdate to the cart is known to the quorum, because the shopper’s readsare being serviced by a slightly out-of-date replica. This may beconfusing, especially since the shopping cart will return to the lateststate after the replicas are brought up to date. However, if theprobability of this occurrence is sufficiently low, this loss ofconsistency may be regarded as a better tradeoff than having theshopping cart be unavailable while the replicas are being brought up todate.

A differentset of tradeoffs between consistency, availability, andpartition-tolerance is offered by multimaster replication. We willconsider these tradeoffs at the end of the next section.