[Bernstein09] Section 7.3. Introduction to Database Recovery

来源:百度文库 编辑:神马文学网 时间:2024/04/29 05:35:19

7.3.

Nowthat we understand how to recover application processes, it’s time toturn our attention to recovering resource managers. As in Chapter 06,“Locking,” we will use the term data manager instead of the moregeneric term resource manager. The most popular type of data manager isa database system. However, the principles apply to any kind oftransactional resource manager, such as queue managers andtransactional file systems.

Torecover from a failure, a data manager needs to quickly return itsdatabase to a state that includes the results of all transactions thatcommitted before the failure and no results of transactions thataborted before the failure or were active at the time of failure. Mostdata managers do an excellent job of this type of recovery. Theapplication programmer doesn’t get involved at all.

Themechanisms used to recover from these failures can have a significanteffect on performance. However, if a data manager uses a recoveryapproach that leads to mediocre transaction performance, there is nottoo much that the application programmer can do about it. This israther different than locking, where application programming anddatabase design can have a big effect. In view of the lack of controlthat an application programmer has on the situation, there is no strongrequirement that he or she have a deep understanding of how a datamanager does recovery.

Still,there are a few ways, though not many, that database and systemadministrators can work together to improve performance, faulttolerance, and the performance of recovery. For example, they canimprove the fault tolerance of a system by altering the configurationof logs, disk devices, and the like. To reason about performance andfault tolerance implications of application and system design, it helpsa great deal to understand the main concepts behind database recoveryalgorithms. We describe these concepts in the rest of this chapter andtheir implications for application programming.

Types of Failure

Manyfailures are due to incorrectly programmed transactions and to dataentry errors that lead to incorrect parameters to transactions.Unfortunately, these failures undermine the assumption that atransaction’s execution preservesthe consistency of the database (the “C” in ACID). They can be dealtwith by applying software engineering techniques to the programming andtesting of transactions, by validating input before feeding it to atransaction, and by semantic integrity mechanisms built into the datamanager. However they’re dealt with, they are intrinsically outside therange of problems that transaction recovery mechanisms canautomatically handle. Since we’re interested in problems thattransaction recovery mechanisms can handle, we will assume that transactions do indeed preserve database consistency.

Thereare three types of failures that are most important to a TP system:transaction failures, system failures, and media failures. A transaction failure occurs when a transaction aborts. A system failureoccurs when the contents of volatile storage, namely main memory, iscorrupted. For example, this can happen to semiconductor memory whenthe power fails. It also happens when the operating system fails.Although an operating system failure may not corrupt all of mainmemory, it is usually too difficult to determine which parts wereactually corrupted by the failure. So one generally assumes the worstand reinitializes all of main memory. Given the possibility of systemfailures, the database itself must be kept on a stable storage medium,such as disk. (Of course, other considerations, such as size, may alsoforce us to store the database on stable mass storage media.) Bydefinition, stable (or nonvolatile) storage withstands system failures. A media failure occurs when any part of the stable storage is destroyed. For instance, this happens if some sectors of a disk become damaged.

Inthis chapter we assume that each transaction accesses and updates dataat exactly one data manager. This allows us to focus our attention onrecovery strategies for a single data manager. In the next chapterwe’ll consider additional problems that arise when a transaction canupdate data at more than one data manager.

Recovery Strategies

The main strategy for recovering from failures is quite simple:

  • Transaction failure: If a transaction aborts, the data manager restores the previous values of all data items that the transaction wrote.

  • System failure: To recover from the failure, the data manager aborts any transactions that were active (i.e., uncommitted) at the time of the failure, and it ensures that each transaction that did commit before the failure actually installed its updates in the database.

  • Media failure: The recovery strategy is nearly the same as for system failures, since the goal is to return the database to a state where it contains the results of all committed transactions and no aborted transactions.

Wewill concentrate on recovery from transaction and system failures formost of the chapter. Recovery from media failures is quite similar torecovery from system failures, so we’ll postpone discussing it untilthe end of the chapter, after we have a complete picture of systemrecovery mechanisms.

It’seasy to see why system and media recovery are so similar. Each recoverymechanism considers a certain part of storage to be unreliable: mainmemory, in the case of system failures; a portion of stable storage, inthe case of media failures. To safeguard against the loss of data inunreliable storage, the recovery mechanism maintains another copy ofthe data, possibly in a different representation. This redundant copyis kept in another part of storage that it deems reliable: stablestorage, in the case of system failures, or another piece of stablestorage, such as a second disk or tape, in the case of media failures.Of course, the different physical characteristics of storage in the twocases may require the use of different strategies. But the principlesare the same.

Themost popular technique for recovering from system and media failures islogging. The log is that second, redundant copy of the data that isused to cope with failures. To understand how a log is used, why itworks, and how it affects performance, we need to start with asimplified model of data manager internals, so we have a framework inwhich to discuss the issues.