[Bernstein09] Chapter 7. System Recovery

来源:百度文库 编辑:神马文学网 时间:2024/04/29 20:17:30
Chapter 7. System Recovery
Causes of System Failure
A Model for System Recovery
Introduction to Database Recovery
The System Model
Database Recovery Manager
Shadow-paging Algorithm
Log-based Database Recovery Algorithms
Optimizing Restart in Log-based Algorithms
Media Recovery
Summary
7.1. Causes of System Failure
Acritical requirement for most TP systems is that they be up all thetime; in other words, highly available. Such systems often are called“24 by 7” (or 24 × 7), since they are intended to run 24 hours per day,7 days per week. Defining this concept more carefully, we say that asystem is available if it is running correctly and yielding the expected results. The availabilityof a system is defined as the fraction of time that the system isavailable. Thus, a highly available system is one that, most of thetime, is running correctly and yielding expected results.
Availability is reduced by two factors. One is the rate at which the system fails. By fails,we mean the system gives the wrong answer or no answer. Other thingsbeing equal, if it fails frequently, it is less available. The secondfactor is recovery time. Other things being equal, the longer it takesto fix the system after it fails, the less it is available. Theseconcepts are captured in two technical terms: mean time betweenfailures and mean time to recovery. The mean time between failures, or MTBF, is the average time the system runs before it fails. MTBF is a measure of system reliability. The mean time to repair, or MTTR,is how long it takes to fix the system after it does fail. Using thesetwo measures, we can define availability precisely as MTBF/(MTBF+MTTR),which is the fraction of time the system is running. Thus, availabilityimproves when reliability (MTBF) increases and when repair time (MTTR)decreases.
In many practical settings, the system is designed to meet a service level agreement (SLA),which is typically a combination of availability, response time, andthroughput. That is, it is not enough that the system is available. Itmust also have satisfactory performance. Of course, poor performancemay arise from many sources, such as the database system, network, oroperating system. Performance problems are sometimes TP-specific, suchas the cases of locking performance discussed inChapter 6.More often, they are specific to other component technologies. Theseproblems are important, but since they are not specific to the TPaspects of the system, we will not consider them here. Instead, wefocus entirely on failures and how to recover from them.
Failures come from a variety of sources. We can categorize them as follows:
The environment: Effects on the physical environment that surrounds the computer system, such as power, communication, air conditioning, fire, and flood.
System management: What people do to manage the system, including vendors doing preventative maintenance and system operators taking care of the system.
Hardware: All hardware devices including processors, memory, I/O controllers, storage devices, etc.
Software: The operating system, communication systems, database systems, transactional middleware, other system software, and application software.
Let’s look at each category of failures and see how we can reduce their frequency.
Hardening the Environment
Onepart of the environment is communications systems that are not underthe control of the people building the computer system, such as longdistance communication provided by a telecommunications company. As acustomer of communication services, sometimes one can improvecommunications reliability by paying more to buy more reliable lines.Otherwise, about all one can do is lease more communication lines thanare needed to meet functional and performance goals. For example, ifone communication line is needed, lease two independent lines instead,so if one fails, the other one will probably still be operating.
Asecond aspect of the environment is power. Given its failure rate, it’soften appropriate to have battery backup for the computer system. Inthe event of power failure, battery backup can at least keep mainmemory alive, so the system can restart immediately after power isrestored without rebooting the operating system, thereby reducing MTTR.Batteries may be able to run the system for a short period, either toprovide useful service (thereby increasing MTBF) or to hibernate thesystem by saving main memory to a persistent storage device (which canimprove availability if recovering from hibernation is faster thanrebooting). To keep running during longer outages, an uninterruptiblepower supply (UPS) is needed. A full UPS generally includes a gas ordiesel powered generator, which can run the system much longer thanbatteries. Batteries are still used to keep the system running for afew minutes until the generator can take over.
Athird environmental issue is air conditioning. An air conditioningfailure can bring down the computer system, so when a computer systemrequires an air conditioned environment, a redundant air conditioningsystem is often advisable.
Systemscan fail due to natural disasters, such as fire, flood, and earthquake,or due to other extraordinary external events, such as war andvandalism. There are things one can do to defend against some of theseevents: build buildings that are less susceptible to fire, that areable to withstand strong earthquakes, and that are secured againstunauthorized entry. How far one goes depends on the cost of thedefense, the benefit to availability, and the cost of downtime to theenterprise. When the system is truly “mission critical,” as in certainmilitary, financial, and transportation applications, an enterprisewill go to extraordinary lengths to reduce the probability of suchfailures. One airline system is housed in an underground bunker.
Afterhardening the environment, the next step is to replicate the system,ideally in a geographically distant location whose environmentaldisasters are unlikely to be correlated to those at other replicas. Forexample, many years ago one California bank built an extra computerfacility east of the San Andreas Fault, so they could still operate iftheir Los Angeles or San Francisco facility were destroyed by anearthquake. More recently, geographical replication has become commonpractice for large-scale Internet sites. Since a system replica isuseful only if it has the data necessary to take over processing for afailed system, data replication is an important enabling technology.Data replication is the subject ofChapter 9.
System Management
Systemmanagement is another cause of failures. People are part of the system.Everybody has an off day or an occasional lapse of attention. It’s onlya matter of time before even the best system operator does somethingthat causes the system to fail.
Thereare several ways to mitigate the problem. One is simply to design thesystem so that it doesn’t require maintenance, such as using automatedprocedures for functions that normally would require operatorintervention. Even preventative maintenance, which is done to increaseavailability by avoiding failures later on, may be a source ofdowntime. Such procedures should be designed to be done while thesystem is operating.
Simplifyingmaintenance procedures also helps, if maintenance can’t be eliminatedentirely. So does building redundancy into maintenance procedures, soan operator has to make at least two mistakes to cause the system tomalfunction. Training is another factor. This is especially importantfor maintenance procedures that are needed infrequently. It’s likehaving a fire drill, where people train for rare events, so when theevents do happen, people know what actions to take.
Softwareinstallation is often a source of planned failures. The installation ofmany software products requires rebooting the operating system.Developing installation procedures that don’t require rebooting is away to improve system reliability.
Manyoperation errors involve reconfiguring the system. Sometimes adding newmachines to a rack or changing the tuning parameters on a databasesystem causes the system to malfunction. Even if it only degradesperformance, rather than causing the system to crash, the effect may bethe same from the end user’s perspective. One can avoid unpleasantsurprises by using configuration management tools that simulate a newconfiguration and demonstrate that it will behave as predicted, or tohave test procedures on a test system that can prove that a changedconfiguration will perform as predicted. Moreover, it is valuable tohave reconfiguration procedures that can be quickly undone, so thatwhen a mistake is made, one can revert to the previous workingconfiguration quickly.
Ifa system is not required to be 24 × 7, then scheduled downtime can beused to handle many of these problems, such as preventativemaintenance, installing software that requires a reboot, orreconfiguring a system. However, from a vendor’s viewpoint, offeringproducts that require such scheduled downtime limits their market onlyto customers that don’t need 24 × 7.
Hardware
The third cause of failures is hardware problems. To discuss hardware failures precisely, we need a few technical terms. A fault is an event inside the system that is believed to have caused a failure. A fault can be either transient or permanent. A transient fault is one that does not reoccur if you retry the operation. A permanent fault is not transient; it is repeatable.
Thevast majority of hardware faults are transient. If the hardware fails,simply retry the operation; there’s a very good chance it will succeed.For this reason, operating systems have many built-in recoveryprocedures to handle transient hardware faults. For example, if theoperating system issues an I/O operation to a disk or a communicationsdevice and gets an error signal back, it normally retries thatoperation many times before it actually reports an error back to thecaller.
Of course,some hardware faults are permanent. The most serious ones cause theoperating system to fail, making the whole system unavailable. In thiscase, rebooting the operating system may get the system back into aworking state. The reboot procedure will detect malfunctioning hardwareand try to reconfigure around it. If the reboot fails or the systemfails shortly after reboot, then the next step is usually to reimagethe disk with a fresh copy of the software, in case it becamecorrupted. If that doesn’t fix the problem, then repairing the hardwareis usually the only option.
Software
Thisbrings us to software failures. The most serious type of softwarefailure is an operating system crash, since it stops the entirecomputer system. Since many software problems are transient, a rebootoften repairs the problem. This involves rebooting the operatingsystem, running software that repairs disk state that might have becomeinconsistent due to the failure, recovering communications sessionswith other systems in a distributed system, and restarting all theapplication programs. These steps all increase the MTTR and therefore reduceavailability. So they should be made as fast as possible. Therequirement for faster recovery inspired operating systems vendors inthe 1990s to incorporate fast file system recovery procedures, whichwas a major component of operating system boot time. Some operatingsystems are carefully engineered for fast boot. For example, highlyavailable communication systems have operating systems that reboot inunder a minute, worst case. Taking this goal to the extreme, if therepair time were zero, then failures wouldn’t matter, since the systemwould recover instantaneously, and the user would never know thedifference. Clearly reducing the repair time can have a big impact onavailability.
Somesoftware failures only degrade a system’s capabilities, not cause it tofail. For example, consider an application that offers functions thatrequire access to a remote service. When the remote service isunavailable, those functions stop working. However, through carefulapplication design, other application functions can still beoperational. That is, the system degrades gracefully when parts of itstop working. A real example we know of is an application that used aTP database and a data warehouse, where the latter was nice to have butnot mission-critical. The application was not designed to degradegracefully, so when the data warehouse failed, the entire applicationbecame unavailable, which caused a large and unnecessary loss ofrevenue.
When anapplication process or database system does fail, the failure must bedetected and the application or database system process must berecovered. This is where TP-specific techniques become relevant.