[Bernstein09] Section 7.7. Log-based Database Recovery Algorithms

来源:百度文库 编辑:神马文学网 时间:2024/04/29 16:33:37

7.7.

Loggingis the most popular technique for implementing a recovery manager. Aswe described earlier, the log contains a record for each write, commit,and abort operation.

Implementing Commit

Toprocess a commit operation, the recovery manager adds a commit recordto the end of the log and flushes the log. The log manager is designedso that it doesn’t acknowledge the flush operation until all the logpages in memory, up to and including the one being flushed, have beenwritten to disk and the disk has acknowledged that the disk writescompleted successfully. At this point, the transaction has beencommitted and the recovery manager can acknowledge this fact to itscaller.

Since all thetransaction’s update records precede the commit record in the log, bywriting the commit record and then flushing the log, the recoverymanager ensures that all the transaction’s updates are in stable storage.That is, it ensures that the force-at-commit rule has been satisfied.It doesn’t matter whether any of the updated pages have been flushed tothe stable database. The updates are in the log, and the log is instable storage, which is enough to satisfy the rule.

Flushing the log to commit a transaction is a potential bottleneck. If the disk that holds the log can do K sequential disk-writes per second, then Kis the maximum number of transactions per second for the whole system.This is too small a number for high performance systems. This isespecially annoying because the log page normally isn’t full when theflush is invoked, so the full bandwidth of the disk isn’t being used.This observation creates an opportunity to improve performance.

A popular way to relieve this bottleneck is an optimization called group commit.After adding a commit record to the log, the recovery managerintroduces a small artificial delay before flushing the log page,something on the order of 1/K;that is, a few milliseconds. During that period, if there are othertransactions running, they can add records to the end of the log—updaterecords, commit records, and abort records. If the system is busy, thenthe chances are that the log page will fill up during this period, andwhen the recovery manager reaches the end of the delay period, it willend up flushing a full page. Thus, each flush operation on the log cancommit many transactions, and the recovery manager is getting the fullvalue of the disk bandwidth. If the system is not busy, then it doesn’tmatter that a partially filled log page is flushed to disk, since notall the disk bandwidth is needed to support the transaction load.

The group commit optimization is an example of a general-purpose technique called boxcarring.When there is a high fixed overhead per write operation, it pays topack a lot of data in each operation. Another place this arises iscommunication systems that have a high fixed cost to send a messageindependent of the message’s size. The term boxcar is a metaphor forthe boxcar in a train, which has a high fixed cost to transportindependent of how full it is.

Implementing Abort

Toprocess an abort operation, the recovery manager has to undo theupdates of any database pages that were updated by the transaction. Itdoes this by tracing through the transaction’s log records, startingfrom the last one, and installing the before-image of each page thatwas updated by the transaction.

Sequentiallysearching the log for the transaction’s update records is ratherinefficient. To avoid this sequential scan, the recovery managermaintains a linked list of all the transaction’s update records in thelog. The list header is a transaction descriptor, which is a data structure that describes each transaction that it knows about (see Figure 7.16).The descriptor includes a pointer to the last log record that waswritten by each transaction. Each update record in the log contains apointer to the previous update record written by the same transaction.So, starting from the transaction descriptor, all the transaction’supdate records can be scanned.

Figure 7.16. Data Structure Supporting Abort Processing. Starting from the transaction descriptor, all the transaction’s update records can be scanned.


Maintainingthe list is easy. When a transaction writes an update record to thelog, it includes a backpointer to the previous log record for thattransaction. Then it updates the transaction descriptor to point to thenew update record, which is now the last one for the transaction.

Thereis still the matter of the write-ahead log protocol to consider. Thesystem needs to ensure that it doesn’t flush a dirty page from thecache to the stable database unless all the update records thatdescribe updates to that page by uncommitted transactions have alreadybeen flushed to the log. To do this, it needs a little help from thecache manager.

We needto add a field to the cache descriptor of each cache slot. This fieldpoints to the log page that we need to worry about to enforce thewrite-ahead log protocol. That is, it contains the address of the logpage that contains the update record describing the last update to thiscache slot’s page (see Figure 7.17). Let’s call this the dependent log page address(there’s no standard term for this). Every time a database page P isupdated, the dependent log page address of P’s cache slot is alsoupdated to point to the page containing the update’s log record. Beforethe cache manager flushes a cache slot, it must check that thedependent log page is not in cache and dirty. If it is, then thedependent log page must be flushed first, to ensure the write-ahead logprotocol is satisfied.

Figure 7.17. Dependent Log Page Address. Before flushing a page, the cache manager must check that the dependent log page is not in cache and dirty.
Page Dirty Bit Cache Address Pin Count Dependent Log Page Address P4 1 104 1 1218 P16 0 376 1 null P5 1 400 0 1332

Althoughthe cache manager has to check the dependent log page address everytime it flushes a page from cache, this rarely generates an extra cacheflush of the log page. The reason is this: The log is a sequentialfile. As soon as a log page fills up, the log manager tells the cachemanager to flush it. By the time the cache manager decides to flush adatabase page, the chances are that the database page has been sittingaround in cache for awhile since it was last updated. For example, thecache replacement algorithm notices that the page hasn’t been accessedrecently and therefore decides to replace it. Since the page hasn’tbeen accessed recently, the chances are that the dependent log page hasalready been flushed.

Aswe will see in a moment, even hot pages must eventually be flushed.Since a hot page is updated frequently, it may have update records inthe tail of the log. So flushing a hot page may be delayed until itsdependent log page has been flushed.

Implementing Restart

Toimplement restart, the recovery manager scans the log to figure outwhich transactions need to be aborted and which committed updates needto be redone. As many algorithms of different complexities are in use,we’ll start with a simple one and optimize it as we go.

Allrestart algorithms depend on the recovery manager to perform checkpointoperations periodically, which synchronize the state of the log withthe state of the stable database. The simplest checkpoint algorithmdoes the following:

  1. It stops accepting any new update, commit, and abort operations. It waits until all active update, commit, and abort operations have finished.

  2. It makes a list of all active transactions along with each transaction’s pointer to its last log record.

  3. It flushes all the dirty pages in cache.

  4. It writes a checkpoint record to the log, which includes the list of active transactions and log pointers.

  5. It resumes accepting new update, commit, and abort operations again.

Atthis point, the stable database state is exactly consistent with thestate of the log. We’ll explain a more efficient checkpointingalgorithm in a moment, but for now, let’s assume we’re using this one.

Therestart algorithm scans the log forward and fully processes each logrecord before proceeding to the next. Its goal is first to redo allupdates that executed after the last checkpoint and then to undo theones that did not commit. It starts at the last checkpoint record.There is no point in looking at log records before the last checkpointrecord, because their effects have been fully recorded in the stabledatabase (see Figure 7.18).The restart algorithm maintains lists of committed and abortedtransactions, which are initially empty; and a list of activetransactions, which is initialized from the last checkpoint record.When the restart algorithm encounters a new log record, it does thefollowing:

  • If the log record is an update record, then it writes the after-image of the update to the cache, and it adds the transaction’s identifier to the active list if it isn’t already there. Notice that even if the update is already in the stable database, there is no harm in writing the after-image, because the after-image contains an entire page image. (Remember our simplifying assumption that each update writes a whole page.) So at worst, it’s just redoing work needlessly.

  • If the log record is a commit record, it adds the transaction to its commit list and removes it from the active list.

  • If the log record is an abort record, it undoes all of the transaction’s updates in the same way as it normally processes an abort. Also, it adds the transaction to its abort list and removes it from the active list.

Figure 7.18. Basic Checkpointing. All dirty pages are flushed before a checkpoint record is written.


Whenit reaches the end of the log, it has redone all the updates ofcommitted and active transactions, and wiped out the effects of anyaborted transactions. At this point, the active list contains anytransactions that started running before the failure but did not commitor abort before the failure. (Notice that since the active list wasinitialized from the last checkpoint record, this includes transactionsthat were active at the last checkpoint but did not subsequently commitor abort.) These transactions cannot continue running, since they losttheir memory state during the system failure, so the restart algorithmaborts them too. Now the system is ready to process new transactions,since the combined state of the cache and stable database includes allcommitted updates and no aborted ones.

Aslong as the restart algorithm is running, users are unable to runtransactions. Therefore, it’s important to optimize it to minimize itsrunning time and therefore maximize the system’s availability. Theseoptimizations are the subject of the next section.