[Bernstein09] 9.5. Multimaster Replication

来源:百度文库 编辑:神马文学网 时间:2024/04/30 23:28:31

9.5. Multimaster Replication

Partitioned Operation Can Be Useful

Ratherthan being the result of a communication failure, a partition issometimes a planned event that happens frequently. For example, alaptop computer might be connected to the network only periodically. Itcould contain areplica of a database, whose primary resides on a reliable server. Whenthe laptop computer is disconnected, it might still be important thatit process updates. For example, consider a laptop that contains asales database and is used by a sales person. Its database might have acustomer table (rarely updated), an orders table (insert-mostly), and asales-call table (append-only). Even when the laptop is disconnectedfrom the network, the sales person must be able to create an order andchange basic customer information, such as address and phone number. Inthis case, it is not satisfactory to require that only the partitionwith a quorum of replicas be operative. Indeed, if there are many salespeople, there probably is no partition with a quorum of replicas, yetall sales people need to be allowed to update their replicas.

Update Propagation with Multiple Masters

Despitethe partition, we could try using the same primary-copy scheme as inthe previous section, but allow update transactions to execute at anyreplica. Each replica logs its updates, as if it were a primary copy.When a replica R reconnects to the network, it sends its logged updatesto the real primary that resides on a reliable server, which canprocess the updates and forward them to other replicas. R can also askthe primary for updates that occurred while it was disconnected.

One problem with this scheme is conflicting updates that originate at different replicas. For example, in Figure 9.12, each transaction executes at a replica. Its updates are applied first to the replica where it executes: transaction T1 updates x at replica R1 and T2 updates x at replica R2.Each replica sends its update to the primary, which forwards it to theother replica. In the end, these conflicting updates are applied indifferent orders by different replicas, so the resulting replicas arenot identical.

Figure 9.12. Conflicting Updates Originating at Different Replicas. The updates to x are applied in different orders by replicas R1 and R2, so the resulting replicas are not identical.


Oneway to avoid this problem is to design the applications so that mostupdates do not conflict. Such updates can be applied in differentorders by different replicas and still produce the same final state atall replicas. For example, in the sales database, a sales personappends a row to the sales-call table every time the sales personinteracts with a customer. This row is unique for each suchinteraction, so two rows generated by different sales people cannotconflict; they may refer to the same customer, but they describedifferent interactions. The orders table is also insert-mostly. Eachinsertion of a new order produces a new row in the order table. Withcareful design,these insertions do not conflict. For example, to ensure insertions donot conflict, the insertion of a new order must not require readingprevious orders; for example, to check that the new order is not aduplicate.

Ifconflicts can occur, then there are several problems to be solved: (1)detecting the conflicts, (2) resolving the conflicts in the same way atall replicas, and (3) ensuring that replicas eventually converge to beidentical. One approach to these problems is to tag each update with aunique timestamp. Unique timestamps can be constructed by concatenatingthe replica’s local clock time with its unique replica identifier, sotimestamps generated at different replicas cannot be identical. Eachdata item at a replica also is tagged with a timestamp. Updates areapplied using Thomas’ Write Rule as follows [Thomas 79] (see Figure 9.13): If an update to data item x arrives at a replica, and the update’s timestamp is larger than x’s timestamp at the replica, then the update is applied and x’s timestamp is replaced by the update’s timestamp. Otherwise, the update is discarded.

Figure 9.13. Thomas’ Write Rule. An update to a data item x is applied only if its timestamp is larger than the one in the database.


Thomas’Write Rule addresses these three problems. It detects a conflict whenan update to a data item arrives at a replica with an update timestamplower than the replica’s timestamp on that data item. It resolves theconflict by retaining the value that has the larger timestamp. Andeventually, each data item x has the same value at all replicas, because at every replica, the update to x with the largest timestamp is the last one that actually was applied.

Thedeletion of a data item needs to be handled like any other update, toensure it is not reinserted by an update with a smaller timestamp thatarrives after the deletion was processed. That is, a deleted data itemmust still be known to the replica and have a timestamp that waswritten by the delete operation, but its value is “deleted.” This valueis usually called a tombstone.

Thomas’Write Rule does not require that clocks be exactly synchronized.However, if the clock at one replica is fast, then its updates willhave larger timestamps than updates concurrently generated by otherreplicas. In conflict situations, the update with the larger timestampwins, so the replica with a fast clock has an unfair advantage. Forthis reason, it is beneficial to keep the clocks nearly synchronizedwhen using Thomas’ Write Rule.

Nonblind Updates

Thomas’ Write Rule works fine for blind updates,which are updates that replace the value of a data item with a newvalue that does not depend on any data that the transaction previouslyread. We just saw two examples: recording a sales call and inserting anew order. Another example is storing a customer’s phone number. Inthis case, two updates by different transactions that store a new phonenumber for the same customer conflict, and if they write differentphone numbers then the execution order certainly matters. However,since the updates were submitted concurrently, either execution orderis satisfactory as long as all replicas reflect the same executionorder. Thomas’ Write Rule solves this problem by ensuring that thefinal value at all replicas is the one written by the update with thehigher timestamp.

Ifthe updates are not blind, then Thomas’ Write Rule isn’t a completelysatisfactory solution, because it doesn’t diagnose that one updatedepends on another one. For example, consider Figure 9.14, which is similar to Figure 9.12 except that the transactions increment x instead of performing blind updates to x, and they use Thomas’ Write Rule. Each value of x is represented by a [value, timestamp] pair. Initially, x= 0 at the primary and replicas and the associated timestamp (ts) is 5.Since transaction timestamps are guaranteed to be unique, T1 and T2 update x with different timestamps, namely 6 and 7, respectively. The updates are concurrent in the sense that neither T1 nor T2 reads the value of x written by the other. Since T2 has the larger timestamp, its value is the one that sticks at all three copies. However, the increment operation by T1is lost, which is probably not what is desired. The execution is notone-copy serializable, since a serial execution of the two transactions(in either order) on a one-copy database would produce a final value ofx = 3. The problem is that the nature of the conflict between T1 and T2was not diagnosed. The system simply retained the update with largertimestamp as if both updates were blind, which they were not.

Figure 9.14. Conflicting Nonblind Updates. Thomas’ Write Rule ensures that x = 2 at all replicas, but T1’s update is lost (ts is an abbreviation for timestamp).


With multimaster replication, situations like Figure 9.14are unavoidable. That is, in general it’s possible for two conflictingtransactions to update different copies of the same data itemindependently at different replicas, such that neither transactionreads the other transaction’s updated value.

Theway multimaster replication is used can greatly affect the probabilityof such conflicts. For example, when multimaster replication is used tosupport disconnected operation, such as laptops that are intermittentlyconnected to the network, replicas can run for long periods withoutexchanging updates. The longer a replica executes transactions withoutexchanging its updates with other replicas, the greater the chance thatreconciliation will be needed. That is, if updates are exchangedfrequently, then the chances are better that an update will bepropagated to all replicas before a transaction with a conflictingupdate executes, thereby avoiding the need for reconciliation.

Whensuch a conflict occurs, Thomas’ Write Rule makes a rather arbitrarychoice by applying the one with larger timestamp and discarding theother one. In a variation of the rule, instead of discarding the onewith smaller timestamp, the system can save it. The saved update canthen be examined later by a person who determines whether it needs tobe reconsidered or merged into the primary somehow.

Detecting Replication Conflicts Using Version Vectors

Insteadof requiring manual reconsideration of the saved update in all cases,we want to distinguish between real conflicts where reconciliation isrequired and fake conflicts where a value really should be overwrittenby a later update. We can do this using a technique called versionvectors.

To explain the use of version vectors, we need the concept of version that was introduced in Section 6.6. Recall that a version of a data item x is the value of x written by a particular transaction. That is, each transaction that updates x produces a new version of x.We introduced this notion in the context of multiversion databases,where the database retains all or most of the versions produced bydifferent transactions. Here, we will typically retain only one versionof a data item. However, we nevertheless need to refer to differentversions of a data item because after a data item is updated, differentreplicas will store different versions during the period that theupdate is being propagated between replicas.

To distinguish between real and fake conflicts, we need precise definitions of them. Given two versions xi and xk of x, we say that xiprecedesxk (written xi → xk) if there is a sequence of transactions, each of which updates x, such that

  • The first transaction in the sequence reads and overwrites xi.

  • Starting with the second transaction in the sequence, each transaction reads and overwrites the version of x produced by the previous transaction in the sequence.

  • The last transaction in the sequence produces version xk.

If xi does not precede xk and xk does not precede xi then we say that xi and xk have a replication conflict (i.e., a “real” conflict). If the database started with one version of x, then the presence of a replication conflict implies there is some version of xthat was overwritten independently by two transactions that did not seeeach other’s output. This is the essence of the kind of conflictexhibited by transactions T1 and T2 in Figure 9.14. If two versions of xare related by the precedes relation, then it’s a fake conflict, sincethe later version clearly should replace the earlier one.

Theversion vector technique enables us to detect replication conflictsbetween versions. It requires that each replica maintain an updatecount. When a transaction updates a data item x for the first time, it associates its local replica ID and current update count with this new version of x,and the update count for the replica is incremented by one. The pair[replica id, update count] uniquely identifies the version and iscalled a version ID. For example, if the current update count for replica R is 8 and the replica runs a transaction T that first updates data item x and then updates y, then T associates version ID [R, 8] with x and [R, 9] with y. If T updates x or y a second time, the version IDs of x and yassociated with T needn’t be changed. Since each replica has a uniquereplica ID, each version ID is unique too. By convention, we use theversion ID of a data item to uniquely identify the final value (not anyintermediate values) written by a transaction into that data item.

To track which versions each replica R has received, R maintains an array of version IDs, called a version vector,with one entry in the array for each replica. Each entry in the versionvector tells which updates R has received and processed from everyother replica. Entry [Ri, c] in the version vector says that R has received all updates generated by Ri with version IDs [Ri, 1], [Ri, 2],..., [Ri, c]. R’s version vector includes an entry for its own latest version ID.

Inaddition to maintaining a version vector for each replica, we alsomaintain a version ID and version vector for each data item at thereplica. This per-data-item information is used to detect replicationconflicts. The version ID and version vector for a data item x is updated when a transaction executing at replica R updates x. After a transaction T executing at R updates a data item x for the first time, it replaces x’s version ID by R’s version ID, it replaces R’s position in x’sversion vector by R’s current version ID, and R increments its versionID. This records the fact that T generated a new version of x at replica R. For example, if T executes at replica R1, R1’s current update count is 14, and T updates x, then T replaces x’s version ID with [R1, 14], T replaces the version ID formerly in x’s version vector, say [R1, c], by [R1, 14], and R increments its update count. It must be that c < 14, because [R1, 14] is the largest version ID that R1 has generated so far.

When replica S (the sender) sends its updated version v of xto another replica R (the receiver), it includes the version ID andversion vector along with the value. The version vector tells R whichupdates were applied to x before v was generated. This gives R enough information to determine whether R’s own version of x has a replication conflict with v, and hence whether it should replace its own version by v.

To see how conflict detection works, suppose that the updated version of xthat S sends to R has version ID [S, 10] and version vector [[S,10],[R, 4]]. Suppose that when R receives the updated version, its value ofx has versionID [R, 5] and version vector [[S, 9], [R, 5]]. In this case, we have areplication conflict. How can R tell? The version vector [[S,10], [R,4]] sent by S tells R that S did not receive R’s updated version [R, 5]before S executed the update that wrote version [S, 10]. On the otherhand, R’s version vector [[S, 9], [R, 5]] for xtells R that R did not receive S’s updated version [S, 10] before itexecuted update [R, 5]. Since neither S nor R saw the other replica’supdated version of x before it executed its latest update, R deduces that there is a replication conflict.

Wewill explain some general approaches to detecting replication conflictsshortly. But first, we describe two other mechanisms that are needed ina complete system, namely, conflict resolution and update propagation.

Conflict Resolution

In the previous example, a simple way for R to deal with the conflict is for it to retain both values of x—theone it already has and the one it just received from S, with theassociated version IDs and version vectors. A later conflict resolutionprocess can determine how to reconcile these two values. This isnecessarily an application-specific process because it depends onknowing (or assuming) something about the semantics of the transactionsthat conflicted. For example, if the conflict resolver knows thattransactions originating at R increment x by one as opposed to doing a blind write of x, then it can add one to the value of x produced by the other transaction to generate a new final value.

Thereplication system can help a little bit by allowing an application toregister one or more merge procedures for each data item, which canthen be invoked automatically when multiple values are stored for thatitem. Alternatively, the two versions of x can be retained and given to the next transaction that reads x, which then has to determine the correct value of x. In any case, the solution is a matter of application programming.

Ifthe resolution executes as a transaction, then its result propagates toother replicas as a normal updated version. If it propagates fastenough, this avoids the need to execute the resolution procedure atother replicas.

Maintaining the Version Vector

Nowlet us see how to propagate recently written versions and maintain theper-replica version vector. Suppose that when a replica S sends updatedversions to a replica R, S sends all updated versions of all data itemsthat it hasn’t previously sent to R. In that case, at the end of theupdate transfer from S to R, if R has received an updated version withversion ID [S, 10] from S, then R knows it has received all updatedversions from S with version ID [S, c] for c ≤ 10.

Sneeds to send to R not only the updated versions from transactions thatS executed since the last time it synchronized with R, but also updatedversions that S received from other replicas, such as R′. R may nothave received those updated versions and indeed may never have theopportunity to synchronize with R′. For example, S may be a server thatis always on the network, and R and R′ are portable machines ondifferent continents that are rarely if ever connected to the networkat the same time.

Thislogic about S’s version IDs sent by S to R applies to R′ too. That is,if S sends an updated version to R with version ID [R′, 5], then at theend of the transfer from S to R, R must also have received all of R′’supdated versions with version IDs less than 5. This observation holdseven if S received R′’s updated versions indirectly via other replicas,because every replica along the path from R′ to S sent all the updatedversions it received earlier from other replicas. That is, R′ sent allof its updated versions with version IDs less than or equal to [R′, 5]to replica R″, which sent them to R′′′, and so on, until they reachedS, which in turn sent them to R. We can summarize this argument as thefollowing invariant.

Version ID invariant: If replica R received an updated version with version ID [Ri, c] and there are no transfers to R in progress, then R received all updated versions generated by Ri with version ID [Ri, c′] for all c′ ≤ c.

Theversion ID invariant implies that R can use a single version ID tosummarize which updates it has received from each replica. For example,R can use version ID [R′, 5] to summarize the fact that R has receivedall updated versions generated by replica R′ with version IDs [R′, 1]through [R′, 5]. By doing this for all replicas, R is maintaining aversion vector.

Therefore,after R has processed all the updates it received from S, R shouldmerge S’s version vector with its own, thereby reflecting the fact thatR’s state now includes all updates that are known to S. This involvesusing the maximum count for each entry in the two version vectors. Forexample, if the Ri entry in the version vectors for S and R are [Ri, c] and [Ri, c′], respectively, then R should replace c′ by the maximum of c and c′.

Sneed not send updated versions that it knows were overwritten. Forexample, if S executed two or more transactions that updated data item x, S needs to send to R only its last updated version of x. There is no point in sending the earlier updated versions to R because they will be overwritten by later updated versions to x sent by S to R.

Given this observation, we have to modify our earlier explanation of the meaning of version vectors. We said that an entry [Ri, c] in the version vector for replica R means that R has received all updates generated by Ri with version IDs [Ri, 1], [Ri, 2],..., [Ri,c]. However, this isn’t true if overwritten versions are notpropagated. Hence, we have to weaken the statement to say that entry [Ri, c] in R’s version vector is the largest version ID of any update that was generated by Ri and received by R. Moreover, R’s state is the same as if it had received all the versions in the sequence [Ri, 1],..., [Ri, c].

Whentwo replicas decide to exchange updates, each one needs to figure outwhich updates to send to the other replica. Version ID’s are helpfulfor this purpose. When replica R wants to receive recent updates fromreplica S, R sends its current version vector to S. Now S runs a queryagainst its local database that retrieves every version whose versionID is greater than the corresponding version ID in R’s version vectorand sends these updates to R. For example, if R’s version vector has anentry [R1, 10], then S should send all data items whose version ID is [R1, b] where b > 10.

Version Vector Update Rules

Asa prelude to presenting update rules based on version vectors, we needa few more definitions. To simplify things a bit, let us assume thatthere are n replicas named R1 through Rn. This enables us to represent a version vector by a sequence of n integers, where entry i in the vector is the count for replica Ri. For example, a version vector [[R1, 7], [R2, 3], [R3, 10], [R4, 7]] would be represented by [7, 3, 10, 7].

We say that a version vector V dominates another version vector V′ if the elements of V and V′ cover the same set of replica ID’s and for every index position i, V[i] ≥ V′[i].For example, if there are three replicas, then version vector [3, 4, 4]dominates [2, 4, 3] and [3, 4, 3]. If V ≠ V′ and neither dominates theother, then we say V and V′ are incomparable. For example, version vector [3, 4, 4] is incomparable to [2, 4, 5] and [2, 4].

Now that we have the complete picture, let’s look at the rules for applying updates, which we call the version vector-based update rules. First, let us recall the rule for running a transaction:

VV1. Suppose a transaction T executes at replica R, which has update counter value c, and T updates data item x. Then T replaces x’s version ID by [R, c], it sets the R position in x’s version vector to c, and it increments R’s update counter by 1.

Suppose a version of x moves from replica S to replica R. More precisely, suppose R receives an updated version xs of some data item x from replica S, where

  • The updated version xs has version ID [Ri, c] and version vector VS = [s1,..., sn], and

  • R’s stored version xr of x has version ID [Rk, d] and version vector VR = [r1,..., rn].

Note that Ri and Rk may be different from both R and S. R processes the update from S as follows:

VV2. If VR dominates VS, then R discards the updated version xs sent by S. In effect, this says that xr should overwrite xs, but since xr arrived at R before xs, R simply discards xs.

VV3. If VS dominates VR, then R replaces its version xr by xs, along with version ID [Ri, c] and version vector [s1,..., sn].

VV4. If VR and VSare incomparable, then there’s a conflict and conflict resolution isneeded. If R resolves the conflict, then the version that R generatesby its conflict resolution procedure has a version vector that’s themerge of the version vectors of the conflicting versions (i.e., takingthe maximum count for each entry in the two version vectors), exceptfor the position corresponding to R, which has the version ID of thenew version generated by R.

Asexplained in the previous section, after R has processed all theupdates it received from S, R should merge S’s version vector with itsown.

The goal of the rules is to ensure that if a version x2 overwrites another version x1, then x1 → x2. Clearly, if a transaction executes according to VV1 or VV4 and overwrites x1 by x2, then x1 → x2. The more interesting cases are VV2 and VV3.

In VV2 and VV3, the decisions are governed by version vector dominance. Consider VV2. If VR dominates VS, then every vector position of VR is greater than or equal to the corresponding position of VS.The only way to create a version vector is to execute a transactionusing VV1 or VV4. Each transaction modifies a version vector byincreasing one of the elements in the vector. Therefore, the only waythat VR can come into existence is that starting with VS, there must be a sequence of transactions that generates successive version vectors where each one updates the version of x generated by the previous one and where the last transaction in the sequence generates VR. By definition of replication conflict, there are no replication conflicts in this sequence. Therefore, if VR dominates VS, then xs → xr, so xr should be retained and xs should be discarded. This is exactly what rule VV2 does. A symmetric argument holds for VV3.

The remaining possibility is VV4, namely that VR does not dominate VS and VS does not dominate VR. In that case VR and VS are incomparable. Thus, there must be elements ra, rb in VR and sa, sb in VS such that ra > sa and sb > rb. Since ra > sa, the transaction that produced version [Ra, ra] was in the sequence of transactions that produced VR but not in the sequence that produced VS, so xr does not precede xs. Similarly, since sb > rb, the transaction that produced version [Rb, rb] was in the sequence of transactions that produced VS but was not in the sequence that produced VR, so xs does not precede xr. Thus the versions tagged by VR and VS exhibit a replication conflict.

Simplified Version Vector Update Rules

Insteadof comparing version vectors, it is actually enough to compare versionIDs of data items to version vectors of replicas, rather than comparingversion vectors for dominance. That is, the rules VV2 and VV3 aremodified to the following:

VI2. If ri≥ c, then R discards S’s updated version.

VI3. If sk≥ d, then R replaces its version of x by the one sent by S, along with version ID [Ri, c] and version vector [s1,..., sn].

As in the previous section, the goal of these rules is to ensure that if a version x2 overwrites another version x1 then x1 → x2. Since VV1 and VV4 are unchanged, they ensure this goal as before.

Thesimplest correctness argument we know of that covers rules VI2 and VI3involves some fairly subtle reasoning. We provide a proof here.However, you can skip the rest of this section without loss ofcontinuity in understanding the rest of the chapter.

We say that a version ID v = [Ri, c] is in a version vector V if the Ri position of V is greater than or equal to c. Notice that the test in VI2 of ri≥ c is testing whether version ID [Ri, c] is in version vector [r1,..., rn]. Similarly, the test in VI3 of sk≥ d is testing whether version ID [Rk, d] is in version vector [s1,..., sn].

Suppose we are given two versions x1 and x2 of x that have version IDs v1 and v2 and version vectors V1 and V2, respectively. We want to show that if VI2 or VI3 overwrites x1 with x2, then x1 → x2. By the observation of the previous paragraph, this is equivalent to showing that if v1 is in V2, then x1 → x2.

First, we restate the definition of → recursively as follows: Given versions x1 and x2 of x, x1 → x2 if and only if either there exists a transaction T that overwrote version x1 with x2 or there exists a transaction T and a version x3 such that x1 → x3 and T overwrote x3 with x2.

We say that a version xi is made by the replica that executed the transaction that created xi. We say that xi is made from the version xj that was held by the replica when it executed the transaction that created xi. When a replica R receives a version xi from replica S and R replaces its version of x by xi, we say that ximoved from S to R.

Now we define a total ordering over the versions. For any version held by a replica R, define its ageto be the number of make and move steps that it took to arrive at thisstate. For conflict-resolving makes (according to VV4), let its age beone greater than the maximum age over all conflicting versions it isresolving. Our proof is by induction on version age, but we will haveto strengthen the statement to be proved to make the induction stepwork:

For every version x2 with version ID v2, version vector V2, and age c held by replica R, and for every version x1 ≠ x2 with version ID v1:

  1. If v1 is in V2 then x1 → x2.

  2. If x1 was made by replica R, x1 has age c1, and c1 < c, then x1 → x2. In other words, every version made by a given replica precedes every version that is later held by that replica.

Basis step: If age = 1, x2 must have been created from x1 by a make step, that is, VV1 or VV4. Clearly, these steps ensure x1 → x2, so both (1) and (2) hold.

Induction step: Suppose x2 is held by replica Rk and has age ck and v2 = [Rn, cn]. Version x2 was made from some version x3, possibly resolving a set of conflicting versions, Con. As in the basis step, x3 → x2 and for all versions xc in Con, xc → x2. Observe that V2 is the merge of the version vector of x3 and those of Con, plus with cn in nth position.

To prove (1), there are two cases:

  1. x1 was not made by Rn. The transaction that made v2 only updated position n of its version vector. Therefore, if v1 is in V2, it must be that v1 is in the version vector of x3 or of one of the versions xc of Con. Since all of these versions are younger than x2, by part (1) of the induction hypothesis, x1 → x3 or x1 → xc. Since x3 → x2 and for all versions xc in Con, xc → x2, by transitivity x1 → x2.

  2. x1 itself was made by Rn with an age smaller than cn. Thus, by induction hypothesis part (2), x1 → x3, and by transitivity x1 → x2.

    To prove (2), observe that there are two cases:

  3. x2 was made by Rk (in other words, Rk = Rn). In that case, x3 was held by Rn before it made x2, and so by induction hypothesis part (2), x1 → x3, and we are done by transitivity.

  4. x2 moved to Rk, overwriting what Rk was holding, say x4 with version ID v4. In this case, according to VV1 and VI2, v4 is in V2. Before x2 moved, it was younger than it currently is, so we apply part (1) of the induction hypothesis to conclude that x4 → x2. But by induction hypothesis part (2), x1 → x4, and so we are done by transitivity.

Example Revisited

Using these rules, let’s revisit the example of Figure 9.14 using version vectors instead of timestamps. The result is shown in Figure 9.15. We rename the primary to be replica R3, so we can use the more compact version vector notation. The version of x is now a triple, comprised of a value, a version ID, and a version vector. Initially, x is 0 at the three replicas. It was last written by a transaction running at replica R3 that generated version ID [R3,1] after which its state is summarized by the version vector [1, 1, 1].The sequence of actions is exactly as before. The interesting cases arewhen each site receives the second updated version. In each case, therecipient recognizes that the second updated version’s version ID andversion vector are incomparable to the ones it has stored, so itdetects a replication conflict and ends in a state containing bothconflicting updates. For example, consider replica R2 when it receives R1’s updated version [1, [R1, 2], [2, 1, 1]] from R3. R2’s local version of x is [2, [R2, 2], [1, 2, 1]]. These versions of x satisfy VV4, which means there is a replication conflict: R2’s local version [R2, 2] was written without having seen version [R1, 2], and the version [R1, 2] sent by R1 was written without having seen version [R2, 2].

Figure 9.15. Using Version Vectors to Reconcile Updates. Initially,x has value 0, produced by version [R3,1] in the state characterized byversion vector [1, 1, 1]. Since T1 and T2 produce incomparable versionIDs and version vectors, their updated versions conflict and areretained at all replicas.


Inthis example, it seems like the use of version vectors has only helpeda little bit in handling concurrent updates of different replicas ofthe same data item. It correctly diagnosed the replication conflict butdid not fully resolve it. However, there are other cases where the useof version vectors fully resolves the situation.

For example, consider Figure 9.16, which is a variation of the scenario in Figure 9.15 where R1 sends its update by T1 directly to R2 in addition to sending it to R3. If R1’s updated version arrives at R2 before R2 executes T2, then T2 will overwrite R1’s updated version of x. R3 will recognize this fact when it receives R2’s updated version of x from R2 and will replace R1’s version of x by R2’s updated version. On the other hand, if R1’s updated version arrives at R2after R2 executes T2, then T2 will not overwrite R1’s updated version of x. R2 will still send the updated version to R3, but in this case R3 will recognize it as a conflict. Notice that timestamps alone cannot make this distinction. For example, if T2 is assigned a larger timestamp than T1, then it will always overwrite T1’s updated version at all replicas, whether or not T2 saw T1’s updated version of x at the time it executed.

Figure 9.16. Diagnosing a Replication Conflict. Using version vectors, when R3 receives T2’s update, it can tell whether that update ran before or after R2 received T1’s update.


Althoughversion vectors do identify replication conflicts, they do not ensureone-copy serializability because they detect conflicts only on a singledata item, not across two or more data items. For example, suppose thedatabase has data items x and y that are stored at replicas R1 and R2. Initially, x has the version [3, [R1, 1], [1, 2]] and y has the version [5, [R2, 2], [1, 2]] at both replicas. Suppose transaction T1 at R1 adds x and y and stores the result, 8, into x with version ID [R1, 2]. Suppose transaction T2 does the same thing, but stores the result, 8, into y with version ID [R2,3]. Each of them propagates their updated version to the other replica.According to the version ID-based update rules, each replica will applythe updated version that it receives from the other replica. So thefinal state at both replicas will have x = [8, [R1, 2], [2, 2]] and y = [8, [R2, 2], [1, 3]]. However, if the transactions ran on a one-copy database, either the resulting value of x would be 11 or the resulting value of y would be 13. So the result is not one-copy serializable.

Consistency, Availability, and Partition-Tolerance Revisited

At the end of Section 9.4we introduced the tradeoff between data consistency, systemavailability, and partition-tolerance. We saw that the primary-copyapproach with synchronous replication offers data consistency andpartition-tolerance at the cost of system availability when a partitionoccurs. With asynchronous replication, it improves availability in somecases at the cost of data consistency.

Themultimaster approach with asynchronous replication offers furtherimprovement of availability and partition-tolerance but with decreaseddata consistency. If a replica R is up and running, then R can be reador written. R need not be part of a quorum of replicas. In fact, evenif R is partitioned from the rest of the network, it is available.However, there is a cost in data consistency. First, since there isdelay in propagating updates to replicas, the system is only eventuallyconsistent, not instantaneously consistent. And second, sincetransactions can update other replicas concurrently with transactionsthat update R, there may be replication conflicts. Such conflictsrepresent a loss of data consistency.

Aswe saw, these conflicts can be detected in certain cases, at whichpoint an application-specific conflict resolution procedure can try toreturn the data to a consistent state. Still, a conflict resolutionprocedure may not be able to make the data perfectly consistent, in thesense that it makes the execution one-copy serializable. For example,if replicas are used for flight reservations, and two replicas rantransactions that sold the last seat on a flight, then the best thatthe conflict resolution procedure can do is run a compensation for oneof the ticket holders. This is not a result that would occur in aone-copy system.

Microsoft Sync Framework

Asan example of a multimaster replication system that uses versionvectors, we consider Microsoft Sync Framework, which was introduced in2007. Like the approach described in this section, it generates aversion ID for each update and maintains a version vector for eachreplica. Like most multimaster implementations, it uses a number ofvariations of the basic techniques outlined in this section. Wehighlight two of them here. First, in most cases it does not attach aversion vector to each data item. Instead, it detects replicationconflicts using the replica’s version vector (not the data item’sversion vector) and modified version ID-based update rules. Second, itallows a transfer of updates from one replica to another to beinterrupted and resumed at another time. This requires additionalmodifications to the maintenance and use of version vectors.

Like before, suppose that replica R receives from replica S an updated version of some data item x with version ID [Ri, c] and that R’s version of x has version ID [Rk, d]. At the time R receives the updated version, its current version vector is [r1,..., rn] and replica S’s current version vector is [s1,..., sn]. Then R decides how to process the update using the following modified version ID-based update rules:

MVI1. If ri≥ c, then discard S’s update.

MVI2. If sk≥ d, then replace the value of x at R by the one sent by S, along with version ID [Ri, c].

MVI3. Otherwise, there is a conflict and conflict resolution is needed.

In the case of MVI3, both values of x are retained. The value sent by S is stored with its version ID and with S’s version vector [s1,..., sn].Thus, some data items have per-data-item version vectors. But these arepresumably a small fraction of the data items at the replica. In anapplication where they are a large fraction of the data items, there isa large number of unresolved replication conflicts, which casts doubton the value of using multimaster replication in this application.

If a later updated version arrives at R for x, then when deciding whether to overwrite the stored version, R uses the version vector associated with x rather than R’s version vector. Similarly, if R forwards this updated version of x to another replica R′, R forwards it with the version vector associated with x and R′ uses that version vector when applying the modified version ID-based update rules, not R’s version vector.

Thesurprising fact about the modified version ID-based update rules isthat they have the same effect as the original version ID-based rules.

Thesecond feature of the Microsoft Sync Framework that we discuss is thatit allows the transfer of updated versions from replica S to R to beinterrupted. Ordinarily, that would cause a problem for R, since it mayhave received an updated version of some data item x but not some updates on which the version depended. It avoids this problem through a technique similar to conflict detection.

To be more precise, if R receives an updated version with version ID [Ri, c] for data item x and the modified version vector-based update rule says to apply the updated version (either overwriting x or adding a conflicting version of x), then R’s state for x is more up to date for xthan for other data items. If the transfer is interrupted, R needs toknow that S’s version vector characterizes the state of updates from Sthat R has applied. These are called exceptions. Eventually, possiblyafter several lengthy interruptions, S completes its transfer ofupdates to R. Now R knows that it is not missing any informationearlier than S’s version vector. Therefore, any exceptions that itaccumulated due to updates it received from R and that are notreplication conflicts can be dropped. At that point, S’s version vectorcan be merged with R’s version vector, as in the normal case.

Supposethere is a total order over data items; for example, by name or storageaddress. Then to minimize the number of the exceptions at R, S can sendthe updated versions to R in data item order. If the transfer isinterrupted after R has received items in (say) the range 1 to m,it can summarize its exceptions by the range exception [1, m] and S’sversion vector. Clearly, this is a much denser representation thanenumerating exceptions for each updated data item that was transferred.

Giventhese techniques, a replica has several sources of knowledge about pastupdates: a per-replica version vector and per-data-item version vectorsfor replication conflicts and for exceptions. When a replica R requestsrecent changes from another replica S, R sends all this knowledge ofits current state to S, so that S can avoid sending updates that Ralready knows about.

Thereare several other special techniques used in this system; for example,to garbage collect tombstones, to enable replicas to join and leave asystem, and to allow different conflict handlers at different replicas.See the bibliographic notes for articles that describe these and otherfeatures.