Reliability and Availability Basics

来源:百度文库 编辑:神马文学网 时间:2024/04/28 04:23:59
Realtime and embedded systems are now a central part of our lives. Reliablefunctioning of these systems is of paramount concern to the millions of usersthat depend on these systems everyday. Unfortunately most embedded systems stillfall short of users expectation of reliability.
In this article we will discuss basic techniques for measuring and improving reliabilityof computer systems. The following topics are discussed:
Failure CharacteristicsHardware Failures
Software Failures
Reliability ParametersMTBF
FITS
MTTR
Availability
Downtime
Hardware failures are typically characterized by a bath tub curve. An examplecurve is shown below. The chance of a hardware failure is high during theinitial life of the module. The failure rate during the rated useful life of theproduct is fairly low. Once the end of the life is reached, failure rate ofmodules increases again.

Hardware failures during a products life can be attributed to the followingcauses:
Design failures This class of failures take place due to inherent design flaws in the system. In a well designed system this class of failures should make a very small contribution to the total number of failures.
Infant Mortality This class of failures cause newly manufactured hardware to fail. This type of failures can be attributed to manufacturing problems like poor soldering, leaking capacitor etc. These failures should not be present in systems leaving the factory as these faults will show up in factory system burn in tests.
Random Failures Random failures can occur during the entire life of a hardware module. These failures can lead to system failures. Redundancy is provided to recover from this class of failures.
Wear Out Once a hardware module has reached the end of its useful life, degradation of component characteristics will cause hardware modules to fail. This type of faults can be weeded out by preventive maintenance and routing of hardware.
Software failures can be characterized by keeping track of software defectdensity in the system. This number can be obtained by keeping track ofhistorical software defect history. Defect density will depend on the followingfactors:
Software process used to develop the design and code (use of peer level design/code reviews, unit testing)
Complexity of the software
Size of the software
Experience of the team developing the software
Percentage of code reused from a previous stable project
Rigor and depth of testing before product is shipped.
Defect density is typically measured in number of defects per thousand linesof code (defects/KLOC).
Mean Time Between Failures (MTBF), as the name suggests, is the averagetime between failure of hardware modules. It is the average time a manufacturerestimates before a failure occurs in a hardware module.
MTBF for hardware modules can beobtained from the vendor for off-the-shelf hardware modules. MTBF for inhousedeveloped hardware modules is calculated by the hardware team developing theboard.
MTBF for software can be determined by simply multiplying the defect ratewith KLOCs executed per second.
FITS is a more intuitive way of representing MTBF. FITS is nothing but thetotal number of failures of the module in a billion hours (i.e. 1000,000,000hours).
Mean Time To Repair (MTTR), is the time taken to repair a failed hardwaremodule. In an operational system, repair generally means replacing the hardwaremodule. Thus hardware MTTR could be viewed as mean time to replace a failedhardware module. It should be a goal of system designers to allow for a highMTTR value and still achieve the system reliability goals. You can see from thetable below that a low MTTR requirement means high operational cost for thesystem.
Estimating the Hardware MTTR
Where are hardware spares kept? How is site manned?  Estimated MTTR
Onsite 24 hours a day 30 minutes
Onsite Operator is on call 24 hours a day 2 hours
Onsite Regular working hours on week days as well as weekends and holidays 14 hours
Onsite Regular working hours on week days only 3 days
Offsite. Shipped by courier when fault condition is encountered. Operator paged by system when a fault is detected. 1 week
Offsite. Maintained in an operator controlled warehouse System is remotely located. Operator needs to be flown in to replace the hardware. 2 week
MTTR for a software module can be computed as the time taken to reboot aftera software fault is detected. Thus software MTTR could be viewed as the meantime to reboot after a software fault has been detected. The goal of systemdesigners should be to keep the software MTTR as low as possible. MTTR forsoftware depends on several factors:
Software fault tolerance techniques used
OS selected (does the OS allow independent application reboot?)
Code image downloading techniques
Estimating Software MTTR
Software fault recovery mechanism Software reboot mechanism on fault detection Estimate MTTR
Software failure is detected by watchdog and/or health messages Processor automatically reboots from a ROM resident image.  30 seconds
Software failure is detected by watchdog and/or health messages Processor automatically restarts the offending tasks, without needing an operating system reboot 30 seconds
Software failure is detected by watchdog and/or health messages Processor automatically reboots and the operating system reboots from disk image and restarts applications 3 minutes
Software failure is detected by watchdog and/or health messages Processor automatically reboots and the operating system and application images have to be download from another machine 10 minutes
Software failure detection is not supported. Manually operator reboot is required. 30 minutes to 2 weeks (software MTTR is same as hardware MTTR)
Availability of the module is the percentage of time when system isoperational. Availability of a hardware/software module can be obtained by theformula given below.

Availability is typically specified in nines notation. For example 3-ninesavailability corresponds to 99.9% availability. A 5-nines availabilitycorresponds to 99.999% availability.
Downtime per year is a more intuitive way of understanding the availability.The table below compares the availability and the corresponding downtime.
Availability Downtime
90% (1-nine) 36.5 days/year
99% (2-nines) 3.65 days/year
99.9% (3-nines) 8.76 hours/year
99.99% (4-nines) 52 minutes/year
99.999% (5-nines) 5 minutes/year
99.9999% (6-nines) 31 seconds/year !