[Bernstein09] 9.5. Multimaster Replication

来源：百度文库编辑：神马文学网时间：2024/04/30 23:28:31

9.5. Multimaster Replication

Partitioned Operation Can Be Useful

Ratherthan being the result of a communication failure, a partition issometimes a planned event that happens frequently. For example, alaptop computer might be connected to the network only periodically. Itcould contain areplica of a database, whose primary resides on a reliable server. Whenthe laptop computer is disconnected, it might still be important thatit process updates. For example, consider a laptop that contains asales database and is used by a sales person. Its database might have acustomer table (rarely updated), an orders table (insert-mostly), and asales-call table (append-only). Even when the laptop is disconnectedfrom the network, the sales person must be able to create an order andchange basic customer information, such as address and phone number. Inthis case, it is not satisfactory to require that only the partitionwith a quorum of replicas be operative. Indeed, if there are many salespeople, there probably is no partition with a quorum of replicas, yetall sales people need to be allowed to update their replicas.

Update Propagation with Multiple Masters

Despitethe partition, we could try using the same primary-copy scheme as inthe previous section, but allow update transactions to execute at anyreplica. Each replica logs its updates, as if it were a primary copy.When a replica R reconnects to the network, it sends its logged updatesto the real primary that resides on a reliable server, which canprocess the updates and forward them to other replicas. R can also askthe primary for updates that occurred while it was disconnected.

One problem with this scheme is conflicting updates that originate at different replicas. For example, in Figure 9.12, each transaction executes at a replica. Its updates are applied first to the replica where it executes: transaction T₁ updates x at replica R₁ and T₂ updates x at replica R₂.Each replica sends its update to the primary, which forwards it to theother replica. In the end, these conflicting updates are applied indifferent orders by different replicas, so the resulting replicas arenot identical.

Figure 9.12. Conflicting Updates Originating at Different Replicas. The updates to x are applied in different orders by replicas R₁ and R₂, so the resulting replicas are not identical.

Oneway to avoid this problem is to design the applications so that mostupdates do not conflict. Such updates can be applied in differentorders by different replicas and still produce the same final state atall replicas. For example, in the sales database, a sales personappends a row to the sales-call table every time the sales personinteracts with a customer. This row is unique for each suchinteraction, so two rows generated by different sales people cannotconflict; they may refer to the same customer, but they describedifferent interactions. The orders table is also insert-mostly. Eachinsertion of a new order produces a new row in the order table. Withcareful design,these insertions do not conflict. For example, to ensure insertions donot conflict, the insertion of a new order must not require readingprevious orders; for example, to check that the new order is not aduplicate.

Ifconflicts can occur, then there are several problems to be solved: (1)detecting the conflicts, (2) resolving the conflicts in the same way atall replicas, and (3) ensuring that replicas eventually converge to beidentical. One approach to these problems is to tag each update with aunique timestamp. Unique timestamps can be constructed by concatenatingthe replica’s local clock time with its unique replica identifier, sotimestamps generated at different replicas cannot be identical. Eachdata item at a replica also is tagged with a timestamp. Updates areapplied using Thomas’ Write Rule as follows [Thomas 79] (see Figure 9.13): If an update to data item x arrives at a replica, and the update’s timestamp is larger than x’s timestamp at the replica, then the update is applied and x’s timestamp is replaced by the update’s timestamp. Otherwise, the update is discarded.

Figure 9.13. Thomas’ Write Rule. An update to a data item x is applied only if its timestamp is larger than the one in the database.

Thomas’Write Rule addresses these three problems. It detects a conflict whenan update to a data item arrives at a replica with an update timestamplower than the replica’s timestamp on that data item. It resolves theconflict by retaining the value that has the larger timestamp. Andeventually, each data item x has the same value at all replicas, because at every replica, the update to x with the largest timestamp is the last one that actually was applied.

Thedeletion of a data item needs to be handled like any other update, toensure it is not reinserted by an update with a smaller timestamp thatarrives after the deletion was processed. That is, a deleted data itemmust still be known to the replica and have a timestamp that waswritten by the delete operation, but its value is “deleted.” This valueis usually called a tombstone.

Thomas’Write Rule does not require that clocks be exactly synchronized.However, if the clock at one replica is fast, then its updates willhave larger timestamps than updates concurrently generated by otherreplicas. In conflict situations, the update with the larger timestampwins, so the replica with a fast clock has an unfair advantage. Forthis reason, it is beneficial to keep the clocks nearly synchronizedwhen using Thomas’ Write Rule.

Nonblind Updates

Thomas’ Write Rule works fine for blind updates,which are updates that replace the value of a data item with a newvalue that does not depend on any data that the transaction previouslyread. We just saw two examples: recording a sales call and inserting anew order. Another example is storing a customer’s phone number. Inthis case, two updates by different transactions that store a new phonenumber for the same customer conflict, and if they write differentphone numbers then the execution order certainly matters. However,since the updates were submitted concurrently, either execution orderis satisfactory as long as all replicas reflect the same executionorder. Thomas’ Write Rule solves this problem by ensuring that thefinal value at all replicas is the one written by the update with thehigher timestamp.

Ifthe updates are not blind, then Thomas’ Write Rule isn’t a completelysatisfactory solution, because it doesn’t diagnose that one updatedepends on another one. For example, consider Figure 9.14, which is similar to Figure 9.12 except that the transactions increment x instead of performing blind updates to x, and they use Thomas’ Write Rule. Each value of x is represented by a [value, timestamp] pair. Initially, x= 0 at the primary and replicas and the associated timestamp (ts) is 5.Since transaction timestamps are guaranteed to be unique, T₁ and T₂ update x with different timestamps, namely 6 and 7, respectively. The updates are concurrent in the sense that neither T₁ nor T₂ reads the value of x written by the other. Since T₂ has the larger timestamp, its value is the one that sticks at all three copies. However, the increment operation by T₁is lost, which is probably not what is desired. The execution is notone-copy serializable, since a serial execution of the two transactions(in either order) on a one-copy database would produce a final value ofx = 3. The problem is that the nature of the conflict between T₁ and T₂was not diagnosed. The system simply retained the update with largertimestamp as if both updates were blind, which they were not.

Figure 9.14. Conflicting Nonblind Updates. Thomas’ Write Rule ensures that x = 2 at all replicas, but T₁’s update is lost (ts is an abbreviation for timestamp).

With multimaster replication, situations like Figure 9.14are unavoidable. That is, in general it’s possible for two conflictingtransactions to update different copies of the same data itemindependently at different replicas, such that neither transactionreads the other transaction’s updated value.

Theway multimaster replication is used can greatly affect the probabilityof such conflicts. For example, when multimaster replication is used tosupport disconnected operation, such as laptops that are intermittentlyconnected to the network, replicas can run for long periods withoutexchanging updates. The longer a replica executes transactions withoutexchanging its updates with other replicas, the greater the chance thatreconciliation will be needed. That is, if updates are exchangedfrequently, then the chances are better that an update will bepropagated to all replicas before a transaction with a conflictingupdate executes, thereby avoiding the need for reconciliation.

Whensuch a conflict occurs, Thomas’ Write Rule makes a rather arbitrarychoice by applying the one with larger timestamp and discarding theother one. In a variation of the rule, instead of discarding the onewith smaller timestamp, the system can save it. The saved update canthen be examined later by a person who determines whether it needs tobe reconsidered or merged into the primary somehow.

Detecting Replication Conflicts Using Version Vectors

Insteadof requiring manual reconsideration of the saved update in all cases,we want to distinguish between real conflicts where reconciliation isrequired and fake conflicts where a value really should be overwrittenby a later update. We can do this using a technique called versionvectors.

To explain the use of version vectors, we need the concept of version that was introduced in Section 6.6. Recall that a version of a data item x is the value of x written by a particular transaction. That is, each transaction that updates x produces a new version of x.We introduced this notion in the context of multiversion databases,where the database retains all or most of the versions produced bydifferent transactions. Here, we will typically retain only one versionof a data item. However, we nevertheless need to refer to differentversions of a data item because after a data item is updated, differentreplicas will store different versions during the period that theupdate is being propagated between replicas.

To distinguish between real and fake conflicts, we need precise definitions of them. Given two versions x_i and x_k of x, we say that x_iprecedesx_k (written x_i → x_k) if there is a sequence of transactions, each of which updates x, such that

The first transaction in the sequence reads and overwrites x_i.
Starting with the second transaction in the sequence, each transaction reads and overwrites the version of x produced by the previous transaction in the sequence.
The last transaction in the sequence produces version x_k.

If x_i does not precede x_k and x_k does not precede x_i then we say that x_i and x_k have a replication conflict (i.e., a “real” conflict). If the database started with one version of x, then the presence of a replication conflict implies there is some version of xthat was overwritten independently by two transactions that did not seeeach other’s output. This is the essence of the kind of conflictexhibited by transactions T₁ and T₂ in Figure 9.14. If two versions of xare related by the precedes relation, then it’s a fake conflict, sincethe later version clearly should replace the earlier one.

Theversion vector technique enables us to detect replication conflictsbetween versions. It requires that each replica maintain an updatecount. When a transaction updates a data item x for the first time, it associates its local replica ID and current update count with this new version of x,and the update count for the replica is incremented by one. The pair[replica id, update count] uniquely identifies the version and iscalled a version ID. For example, if the current update count for replica R is 8 and the replica runs a transaction T that first updates data item x and then updates y, then T associates version ID [R, 8] with x and [R, 9] with y. If T updates x or y a second time, the version IDs of x and yassociated with T needn’t be changed. Since each replica has a uniquereplica ID, each version ID is unique too. By convention, we use theversion ID of a data item to uniquely identify the final value (not anyintermediate values) written by a transaction into that data item.

To track which versions each replica R has received, R maintains an array of version IDs, called a version vector,with one entry in the array for each replica. Each entry in the versionvector tells which updates R has received and processed from everyother replica. Entry [R_i, c] in the version vector says that R has received all updates generated by R_i with version IDs [R_i, 1], [R_i, 2],..., [R_i, c]. R’s version vector includes an entry for its own latest version ID.

Inaddition to maintaining a version vector for each replica, we alsomaintain a version ID and version vector for each data item at thereplica. This per-data-item information is used to detect replicationconflicts. The version ID and version vector for a data item x is updated when a transaction executing at replica R updates x. After a transaction T executing at R updates a data item x for the first time, it replaces x’s version ID by R’s version ID, it replaces R’s position in x’sversion vector by R’s current version ID, and R increments its versionID. This records the fact that T generated a new version of x at replica R. For example, if T executes at replica R₁, R₁’s current update count is 14, and T updates x, then T replaces x’s version ID with [R₁, 14], T replaces the version ID formerly in x’s version vector, say [R₁, c], by [R₁, 14], and R increments its update count. It must be that c < 14, because [R₁, 14] is the largest version ID that R₁ has generated so far.

When replica S (the sender) sends its updated version v of xto another replica R (the receiver), it includes the version ID andversion vector along with the value. The version vector tells R whichupdates were applied to x before v was generated. This gives R enough information to determine whether R’s own version of x has a replication conflict with v, and hence whether it should replace its own version by v.

To see how conflict detection works, suppose that the updated version of xthat S sends to R has version ID [S, 10] and version vector [[S,10],[R, 4]]. Suppose that when R receives the updated version, its value ofx has versionID [R, 5] and version vector [[S, 9], [R, 5]]. In this case, we have areplication conflict. How can R tell? The version vector [[S,10], [R,4]] sent by S tells R that S did not receive R’s updated version [R, 5]before S executed the update that wrote version [S, 10]. On the otherhand, R’s version vector [[S, 9], [R, 5]] for xtells R that R did not receive S’s updated version [S, 10] before itexecuted update [R, 5]. Since neither S nor R saw the other replica’supdated version of x before it executed its latest update, R deduces that there is a replication conflict.

Wewill explain some general approaches to detecting replication conflictsshortly. But first, we describe two other mechanisms that are needed ina complete system, namely, conflict resolution and update propagation.

Conflict Resolution

In the previous example, a simple way for R to deal with the conflict is for it to retain both values of x—theone it already has and the one it just received from S, with theassociated version IDs and version vectors. A later conflict resolutionprocess can determine how to reconcile these two values. This isnecessarily an application-specific process because it depends onknowing (or assuming) something about the semantics of the transactionsthat conflicted. For example, if the conflict resolver knows thattransactions originating at R increment x by one as opposed to doing a blind write of x, then it can add one to the value of x produced by the other transaction to generate a new final value.

Thereplication system can help a little bit by allowing an application toregister one or more merge procedures for each data item, which canthen be invoked automatically when multiple values are stored for thatitem. Alternatively, the two versions of x can be retained and given to the next transaction that reads x, which then has to determine the correct value of x. In any case, the solution is a matter of application programming.

Ifthe resolution executes as a transaction, then its result propagates toother replicas as a normal updated version. If it propagates fastenough, this avoids the need to execute the resolution procedure atother replicas.

Maintaining the Version Vector

Nowlet us see how to propagate recently written versions and maintain theper-replica version vector. Suppose that when a replica S sends updatedversions to a replica R, S sends all updated versions of all data itemsthat it hasn’t previously sent to R. In that case, at the end of theupdate transfer from S to R, if R has received an updated version withversion ID [S, 10] from S, then R knows it has received all updatedversions from S with version ID [S, c] for c ≤ 10.

Sneeds to send to R not only the updated versions from transactions thatS executed since the last time it synchronized with R, but also updatedversions that S received from other replicas, such as R′. R may nothave received those updated versions and indeed may never have theopportunity to synchronize with R′. For example, S may be a server thatis always on the network, and R and R′ are portable machines ondifferent continents that are rarely if ever connected to the networkat the same time.

Thislogic about S’s version IDs sent by S to R applies to R′ too. That is,if S sends an updated version to R with version ID [R′, 5], then at theend of the transfer from S to R, R must also have received all of R′’supdated versions with version IDs less than 5. This observation holdseven if S received R′’s updated versions indirectly via other replicas,because every replica along the path from R′ to S sent all the updatedversions it received earlier from other replicas. That is, R′ sent allof its updated versions with version IDs less than or equal to [R′, 5]to replica R″, which sent them to R′′′, and so on, until they reachedS, which in turn sent them to R. We can summarize this argument as thefollowing invariant.

Version ID invariant: If replica R received an updated version with version ID [R_i, c] and there are no transfers to R in progress, then R received all updated versions generated by R_i with version ID [R_i, c′] for all c′ ≤ c.

Theversion ID invariant implies that R can use a single version ID tosummarize which updates it has received from each replica. For example,R can use version ID [R′, 5] to summarize the fact that R has receivedall updated versions generated by replica R′ with version IDs [R′, 1]through [R′, 5]. By doing this for all replicas, R is maintaining aversion vector.

Therefore,after R has processed all the updates it received from S, R shouldmerge S’s version vector with its own, thereby reflecting the fact thatR’s state now includes all updates that are known to S. This involvesusing the maximum count for each entry in the two version vectors. Forexample, if the R_i entry in the version vectors for S and R are [R_i, c] and [R_i, c′], respectively, then R should replace c′ by the maximum of c and c′.

Sneed not send updated versions that it knows were overwritten. Forexample, if S executed two or more transactions that updated data item x, S needs to send to R only its last updated version of x. There is no point in sending the earlier updated versions to R because they will be overwritten by later updated versions to x sent by S to R.

Given this observation, we have to modify our earlier explanation of the meaning of version vectors. We said that an entry [R_i, c] in the version vector for replica R means that R has received all updates generated by R_i with version IDs [R_i, 1], [R_i, 2],..., [R_i,c]. However, this isn’t true if overwritten versions are notpropagated. Hence, we have to weaken the statement to say that entry [R_i, c] in R’s version vector is the largest version ID of any update that was generated by R_i and received by R. Moreover, R’s state is the same as if it had received all the versions in the sequence [R_i, 1],..., [R_i, c].

Whentwo replicas decide to exchange updates, each one needs to figure outwhich updates to send to the other replica. Version ID’s are helpfulfor this purpose. When replica R wants to receive recent updates fromreplica S, R sends its current version vector to S. Now S runs a queryagainst its local database that retrieves every version whose versionID is greater than the corresponding version ID in R’s version vectorand sends these updates to R. For example, if R’s version vector has anentry [R₁, 10], then S should send all data items whose version ID is [R₁, b] where b > 10.

Version Vector Update Rules

Asa prelude to presenting update rules based on version vectors, we needa few more definitions. To simplify things a bit, let us assume thatthere are n replicas named R₁ through R_n. This enables us to represent a version vector by a sequence of n integers, where entry i in the vector is the count for replica R_i. For example, a version vector [[R₁, 7], [R₂, 3], [R₃, 10], [R₄, 7]] would be represented by [7, 3, 10, 7].

We say that a version vector V dominates another version vector V′ if the elements of V and V′ cover the same set of replica ID’s and for every index position i, V[i] ≥ V′[i].For example, if there are three replicas, then version vector [3, 4, 4]dominates [2, 4, 3] and [3, 4, 3]. If V ≠ V′ and neither dominates theother, then we say V and V′ are incomparable. For example, version vector [3, 4, 4] is incomparable to [2, 4, 5] and [2, 4].

Now that we have the complete picture, let’s look at the rules for applying updates, which we call the version vector-based update rules. First, let us recall the rule for running a transaction:

VV1. Suppose a transaction T executes at replica R, which has update counter value c, and T updates data item x. Then T replaces x’s version ID by [R, c], it sets the R position in x’s version vector to c, and it increments R’s update counter by 1.

Suppose a version of x moves from replica S to replica R. More precisely, suppose R receives an updated version x_s of some data item x from replica S, where

The updated version x_s has version ID [R_i, c] and version vector V_S = [s₁,..., s_n], and
R’s stored version x_r of x has version ID [R_k, d] and version vector V_R = [r₁,..., r_n].

Note that R_i and R_k may be different from both R and S. R processes the update from S as follows:

VV2. If V_R dominates V_S, then R discards the updated version x_s sent by S. In effect, this says that x_r should overwrite x_s, but since x_r arrived at R before x_s, R simply discards x_s.
VV3. If V_S dominates V_R, then R replaces its version x_r by x_s, along with version ID [R_i, c] and version vector [s₁,..., s_n].
VV4. If V_R and V_Sare incomparable, then there’s a conflict and conflict resolution isneeded. If R resolves the conflict, then the version that R generatesby its conflict resolution procedure has a version vector that’s themerge of the version vectors of the conflicting versions (i.e., takingthe maximum count for each entry in the two version vectors), exceptfor the position corresponding to R, which has the version ID of thenew version generated by R.

Asexplained in the previous section, after R has processed all theupdates it received from S, R should merge S’s version vector with itsown.

The goal of the rules is to ensure that if a version x₂ overwrites another version x₁, then x₁ → x₂. Clearly, if a transaction executes according to VV1 or VV4 and overwrites x₁ by x₂, then x₁ → x₂. The more interesting cases are VV2 and VV3.

In VV2 and VV3, the decisions are governed by version vector dominance. Consider VV2. If V_R dominates V_S, then every vector position of V_R is greater than or equal to the corresponding position of V_S.The only way to create a version vector is to execute a transactionusing VV1 or VV4. Each transaction modifies a version vector byincreasing one of the elements in the vector. Therefore, the only waythat V_R can come into existence is that starting with V_S, there must be a sequence of transactions that generates successive version vectors where each one updates the version of x generated by the previous one and where the last transaction in the sequence generates V_R. By definition of replication conflict, there are no replication conflicts in this sequence. Therefore, if V_R dominates V_S, then x_s → x_r, so x_r should be retained and x_s should be discarded. This is exactly what rule VV2 does. A symmetric argument holds for VV3.

The remaining possibility is VV4, namely that V_R does not dominate V_S and V_S does not dominate V_R. In that case V_R and V_S are incomparable. Thus, there must be elements r_a, r_b in V_R and s_a, s_b in V_S such that r_a > s_a and s_b > r_b. Since r_a > s_a, the transaction that produced version [R_a, r_a] was in the sequence of transactions that produced V_R but not in the sequence that produced V_S, so x_r does not precede x_s. Similarly, since s_b > r_b, the transaction that produced version [R_b, r_b] was in the sequence of transactions that produced V_S but was not in the sequence that produced V_R, so x_s does not precede x_r. Thus the versions tagged by V_R and V_S exhibit a replication conflict.

Simplified Version Vector Update Rules

Insteadof comparing version vectors, it is actually enough to compare versionIDs of data items to version vectors of replicas, rather than comparingversion vectors for dominance. That is, the rules VV2 and VV3 aremodified to the following:

VI2. If r_i≥ c, then R discards S’s updated version.
VI3. If s_k≥ d, then R replaces its version of x by the one sent by S, along with version ID [R_i, c] and version vector [s₁,..., s_n].

As in the previous section, the goal of these rules is to ensure that if a version x₂ overwrites another version x₁ then x₁ → x₂. Since VV1 and VV4 are unchanged, they ensure this goal as before.

Thesimplest correctness argument we know of that covers rules VI2 and VI3involves some fairly subtle reasoning. We provide a proof here.However, you can skip the rest of this section without loss ofcontinuity in understanding the rest of the chapter.

We say that a version ID v = [R_i, c] is in a version vector V if the R_i position of V is greater than or equal to c. Notice that the test in VI2 of r_i≥ c is testing whether version ID [R_i, c] is in version vector [r₁,..., r_n]. Similarly, the test in VI3 of s_k≥ d is testing whether version ID [R_k, d] is in version vector [s₁,..., s_n].

Suppose we are given two versions x₁ and x₂ of x that have version IDs v₁ and v₂ and version vectors V₁ and V₂, respectively. We want to show that if VI2 or VI3 overwrites x₁ with x₂, then x₁ → x₂. By the observation of the previous paragraph, this is equivalent to showing that if v₁ is in V₂, then x₁ → x₂.

First, we restate the definition of → recursively as follows: Given versions x₁ and x₂ of x, x₁ → x₂ if and only if either there exists a transaction T that overwrote version x₁ with x₂ or there exists a transaction T and a version x₃ such that x₁ → x₃ and T overwrote x₃ with x₂.

We say that a version x_i is made by the replica that executed the transaction that created x_i. We say that x_i is made from the version x_j that was held by the replica when it executed the transaction that created x_i. When a replica R receives a version x_i from replica S and R replaces its version of x by x_i, we say that x_imoved from S to R.

Now we define a total ordering over the versions. For any version held by a replica R, define its ageto be the number of make and move steps that it took to arrive at thisstate. For conflict-resolving makes (according to VV4), let its age beone greater than the maximum age over all conflicting versions it isresolving. Our proof is by induction on version age, but we will haveto strengthen the statement to be proved to make the induction stepwork:

For every version x₂ with version ID v₂, version vector V₂, and age c held by replica R, and for every version x₁ ≠ x₂ with version ID v₁:

If v₁ is in V₂ then x₁ → x₂.

If x₁ was made by replica R, x₁ has age c₁, and c₁ < c, then x₁ → x₂. In other words, every version made by a given replica precedes every version that is later held by that replica.

Basis step: If age = 1, x₂ must have been created from x₁ by a make step, that is, VV1 or VV4. Clearly, these steps ensure x₁ → x₂, so both (1) and (2) hold.
Induction step: Suppose x₂ is held by replica R_k and has age c_k and v₂ = [R_n, c_n]. Version x₂ was made from some version x₃, possibly resolving a set of conflicting versions, Con. As in the basis step, x₃ → x₂ and for all versions x_c in Con, x_c → x₂. Observe that V₂ is the merge of the version vector of x₃ and those of Con, plus with c_n in n^th position.

To prove (1), there are two cases:

x₁ was not made by R_n. The transaction that made v₂ only updated position n of its version vector. Therefore, if v₁ is in V₂, it must be that v₁ is in the version vector of x₃ or of one of the versions x_c of Con. Since all of these versions are younger than x₂, by part (1) of the induction hypothesis, x₁ → x₃ or x₁ → x_c. Since x₃ → x₂ and for all versions x_c in Con, x_c → x₂, by transitivity x₁ → x₂.
x₁ itself was made by R_n with an age smaller than c_n. Thus, by induction hypothesis part (2), x₁ → x₃, and by transitivity x₁ → x₂.

To prove (2), observe that there are two cases:
x₂ was made by R_k (in other words, R_k = R_n). In that case, x₃ was held by R_n before it made x₂, and so by induction hypothesis part (2), x₁ → x₃, and we are done by transitivity.
x₂ moved to R_k, overwriting what R_k was holding, say x₄ with version ID v₄. In this case, according to VV₁ and VI₂, v₄ is in V₂. Before x₂ moved, it was younger than it currently is, so we apply part (1) of the induction hypothesis to conclude that x₄ → x₂. But by induction hypothesis part (2), x₁ → x₄, and so we are done by transitivity.

Example Revisited

Using these rules, let’s revisit the example of Figure 9.14 using version vectors instead of timestamps. The result is shown in Figure 9.15. We rename the primary to be replica R₃, so we can use the more compact version vector notation. The version of x is now a triple, comprised of a value, a version ID, and a version vector. Initially, x is 0 at the three replicas. It was last written by a transaction running at replica R₃ that generated version ID [R₃,1] after which its state is summarized by the version vector [1, 1, 1].The sequence of actions is exactly as before. The interesting cases arewhen each site receives the second updated version. In each case, therecipient recognizes that the second updated version’s version ID andversion vector are incomparable to the ones it has stored, so itdetects a replication conflict and ends in a state containing bothconflicting updates. For example, consider replica R₂ when it receives R₁’s updated version [1, [R₁, 2], [2, 1, 1]] from R₃. R₂’s local version of x is [2, [R₂, 2], [1, 2, 1]]. These versions of x satisfy VV4, which means there is a replication conflict: R₂’s local version [R₂, 2] was written without having seen version [R₁, 2], and the version [R₁, 2] sent by R₁ was written without having seen version [R₂, 2].

Figure 9.15. Using Version Vectors to Reconcile Updates. Initially,x has value 0, produced by version [R3,1] in the state characterized byversion vector [1, 1, 1]. Since T1 and T2 produce incomparable versionIDs and version vectors, their updated versions conflict and areretained at all replicas.

Inthis example, it seems like the use of version vectors has only helpeda little bit in handling concurrent updates of different replicas ofthe same data item. It correctly diagnosed the replication conflict butdid not fully resolve it. However, there are other cases where the useof version vectors fully resolves the situation.

For example, consider Figure 9.16, which is a variation of the scenario in Figure 9.15 where R₁ sends its update by T₁ directly to R₂ in addition to sending it to R₃. If R₁’s updated version arrives at R₂ before R₂ executes T₂, then T₂ will overwrite R₁’s updated version of x. R₃ will recognize this fact when it receives R₂’s updated version of x from R₂ and will replace R₁’s version of x by R₂’s updated version. On the other hand, if R₁’s updated version arrives at R₂after R₂ executes T₂, then T₂ will not overwrite R₁’s updated version of x. R₂ will still send the updated version to R₃, but in this case R₃ will recognize it as a conflict. Notice that timestamps alone cannot make this distinction. For example, if T₂ is assigned a larger timestamp than T₁, then it will always overwrite T₁’s updated version at all replicas, whether or not T₂ saw T₁’s updated version of x at the time it executed.

Figure 9.16. Diagnosing a Replication Conflict. Using version vectors, when R₃ receives T₂’s update, it can tell whether that update ran before or after R₂ received T₁’s update.

Althoughversion vectors do identify replication conflicts, they do not ensureone-copy serializability because they detect conflicts only on a singledata item, not across two or more data items. For example, suppose thedatabase has data items x and y that are stored at replicas R₁ and R₂. Initially, x has the version [3, [R₁, 1], [1, 2]] and y has the version [5, [R₂, 2], [1, 2]] at both replicas. Suppose transaction T₁ at R₁ adds x and y and stores the result, 8, into x with version ID [R₁, 2]. Suppose transaction T₂ does the same thing, but stores the result, 8, into y with version ID [R₂,3]. Each of them propagates their updated version to the other replica.According to the version ID-based update rules, each replica will applythe updated version that it receives from the other replica. So thefinal state at both replicas will have x = [8, [R₁, 2], [2, 2]] and y = [8, [R₂, 2], [1, 3]]. However, if the transactions ran on a one-copy database, either the resulting value of x would be 11 or the resulting value of y would be 13. So the result is not one-copy serializable.

Consistency, Availability, and Partition-Tolerance Revisited

At the end of Section 9.4we introduced the tradeoff between data consistency, systemavailability, and partition-tolerance. We saw that the primary-copyapproach with synchronous replication offers data consistency andpartition-tolerance at the cost of system availability when a partitionoccurs. With asynchronous replication, it improves availability in somecases at the cost of data consistency.

Themultimaster approach with asynchronous replication offers furtherimprovement of availability and partition-tolerance but with decreaseddata consistency. If a replica R is up and running, then R can be reador written. R need not be part of a quorum of replicas. In fact, evenif R is partitioned from the rest of the network, it is available.However, there is a cost in data consistency. First, since there isdelay in propagating updates to replicas, the system is only eventuallyconsistent, not instantaneously consistent. And second, sincetransactions can update other replicas concurrently with transactionsthat update R, there may be replication conflicts. Such conflictsrepresent a loss of data consistency.

Aswe saw, these conflicts can be detected in certain cases, at whichpoint an application-specific conflict resolution procedure can try toreturn the data to a consistent state. Still, a conflict resolutionprocedure may not be able to make the data perfectly consistent, in thesense that it makes the execution one-copy serializable. For example,if replicas are used for flight reservations, and two replicas rantransactions that sold the last seat on a flight, then the best thatthe conflict resolution procedure can do is run a compensation for oneof the ticket holders. This is not a result that would occur in aone-copy system.

Microsoft Sync Framework

Asan example of a multimaster replication system that uses versionvectors, we consider Microsoft Sync Framework, which was introduced in2007. Like the approach described in this section, it generates aversion ID for each update and maintains a version vector for eachreplica. Like most multimaster implementations, it uses a number ofvariations of the basic techniques outlined in this section. Wehighlight two of them here. First, in most cases it does not attach aversion vector to each data item. Instead, it detects replicationconflicts using the replica’s version vector (not the data item’sversion vector) and modified version ID-based update rules. Second, itallows a transfer of updates from one replica to another to beinterrupted and resumed at another time. This requires additionalmodifications to the maintenance and use of version vectors.

Like before, suppose that replica R receives from replica S an updated version of some data item x with version ID [R_i, c] and that R’s version of x has version ID [R_k, d]. At the time R receives the updated version, its current version vector is [r₁,..., r_n] and replica S’s current version vector is [s₁,..., s_n]. Then R decides how to process the update using the following modified version ID-based update rules:

MVI1. If r_i≥ c, then discard S’s update.
MVI2. If s_k≥ d, then replace the value of x at R by the one sent by S, along with version ID [R_i, c].
MVI3. Otherwise, there is a conflict and conflict resolution is needed.

In the case of MVI3, both values of x are retained. The value sent by S is stored with its version ID and with S’s version vector [s₁,..., s_n].Thus, some data items have per-data-item version vectors. But these arepresumably a small fraction of the data items at the replica. In anapplication where they are a large fraction of the data items, there isa large number of unresolved replication conflicts, which casts doubton the value of using multimaster replication in this application.

If a later updated version arrives at R for x, then when deciding whether to overwrite the stored version, R uses the version vector associated with x rather than R’s version vector. Similarly, if R forwards this updated version of x to another replica R′, R forwards it with the version vector associated with x and R′ uses that version vector when applying the modified version ID-based update rules, not R’s version vector.

Thesurprising fact about the modified version ID-based update rules isthat they have the same effect as the original version ID-based rules.

Thesecond feature of the Microsoft Sync Framework that we discuss is thatit allows the transfer of updated versions from replica S to R to beinterrupted. Ordinarily, that would cause a problem for R, since it mayhave received an updated version of some data item x but not some updates on which the version depended. It avoids this problem through a technique similar to conflict detection.

To be more precise, if R receives an updated version with version ID [R_i, c] for data item x and the modified version vector-based update rule says to apply the updated version (either overwriting x or adding a conflicting version of x), then R’s state for x is more up to date for xthan for other data items. If the transfer is interrupted, R needs toknow that S’s version vector characterizes the state of updates from Sthat R has applied. These are called exceptions. Eventually, possiblyafter several lengthy interruptions, S completes its transfer ofupdates to R. Now R knows that it is not missing any informationearlier than S’s version vector. Therefore, any exceptions that itaccumulated due to updates it received from R and that are notreplication conflicts can be dropped. At that point, S’s version vectorcan be merged with R’s version vector, as in the normal case.

Supposethere is a total order over data items; for example, by name or storageaddress. Then to minimize the number of the exceptions at R, S can sendthe updated versions to R in data item order. If the transfer isinterrupted after R has received items in (say) the range 1 to m,it can summarize its exceptions by the range exception [1, m] and S’sversion vector. Clearly, this is a much denser representation thanenumerating exceptions for each updated data item that was transferred.

Giventhese techniques, a replica has several sources of knowledge about pastupdates: a per-replica version vector and per-data-item version vectorsfor replication conflicts and for exceptions. When a replica R requestsrecent changes from another replica S, R sends all this knowledge ofits current state to S, so that S can avoid sending updates that Ralready knows about.

Thereare several other special techniques used in this system; for example,to garbage collect tombstones, to enable replicas to join and leave asystem, and to allow different conflict handlers at different replicas.See the bibliographic notes for articles that describe these and otherfeatures.

[Bernstein09] 9.5. Multimaster Replication [Bernstein09] Chapter 9. Replication [Bernstein09] 9.6. Other Replication Techniques [Bernstein09] 9.4. Single-Master Primary-Copy Replication ? MySQL Replication(复制)基本原理 [Bernstein09] Bibliography [Bernstein09] Section 3.8. Summary [Bernstein09] Section 4.9. Summary [Bernstein09] Section 5.8. Summary [Bernstein09] 2.2. Transactions [Bernstein09] 2.6. Scalability [Bernstein09] 2.7. Summary [Bernstein09] 10.9. Summary [Bernstein09] 9.2. Replicated Servers [Bernstein09] 9.8. Summary [Bernstein09] Bibliographic Notes [Bernstein09] Section 3.2. Application Architecture [Bernstein09] Section 3.4. Request Controller [Bernstein09] Section 3.5. Transaction Servers [Bernstein09] Section 3.6. Transactional Middleware [Bernstein09] Section 4.3. Client Recovery [Bernstein09] Section 4.6. Publish-Subscribe [Bernstein09] Section 5.4. Transactional Properties [Bernstein09] Chapter 7. System Recovery