Hardware Fault Tolerance, Redundancy Schemes and Fault Handling

来源:百度文库 编辑:神马文学网 时间:2024/04/26 20:24:07
Hardware Fault Tolerance
Most Realtime systems must function with very high availability even underhardware fault conditions. This article covers several techniques that are usedto minimize the impact of hardware faults.
Redundancy Schemes
Realtime systems are equipped with redundant hardware modules. Whenever afault is encountered, the redundant modules takeover the functions of failedhardware module. Hardware redundancy may be provided in one of the followingways:
One for One Redundancy
N + X Redundancy
Load Sharing
One for One Redundancy
Here, each hardware module has a redundant hardware module. The hardwaremodule that performs the functions under normal conditions is called Active andthe redundant unit is called Standby. The standby keeps monitoring the activeunit at all times. It will takeover and become active if the active unit fails.Since standby has to takeover under fault conditions it has to keep itselfsynchronized with the active unit operations.
Since the probability of both the units failing at the same time is very low,this technique provides the highest level of availability. The main disadvantagehere is that it doubles the hardware cost.
CAS redundancyscheme in the Xenon Switching System is a good example of one for oneredundancy.
N + X Redundancy
In this scheme,  if N hardware modules are required to perform systemfunctions, the system is configured with N + X hardware modules; typically X ismuch smaller than N. Whenever any of the N modules fails, one of the X modulestakes over its functions. Since health monitoring of N units by X units at all times is notpractical, a higher level module monitors the health of N units. If one of the Nunits fails, it selects one of the X units ( It may be noted that one for one is a special case ofN + X).
The advantage lies in reduced hardware cost of the system as only X units arerequired to backup N units. However, in case of multiple failures, this schemeprovides lesser system availability.
XEN redundancy scheme in Xenon Switching System is a good example of N+ Xredundancy.
Load Sharing
In this scheme, under zero fault conditions, all the hardware modules thatare equipped to perform system functions, share the load. A higher level moduleperforms the load distribution. It also maintains the health status of thehardware units. If one of the load sharing module fails, the higher level module starts distributing the loadamong the rest of the units. There is graceful degradation of performance withhardware failure.
Here, there is almost no extra hardware cost to provide the redundancy. The main disadvantage is that if a hardware failure happens during the busyhour, system will perform at a sub-optimal level until the failed module isreplaced.
In the WebTaxi design, theTaxiSession processor uses load sharing to distribute the taxi session load.
Network Load Balancing
Network load balancing is a different flavor of load sharing where there is no higher level processor to perform load distribution. Instead, the load distribution is achieved by hashing on the source address bits. For example, many high traffic websites perform load sharing by broadcasting the HTTP Get request over the Ethernet to all the load sharing machines. The network card on the load sharing machines are appropriately configured to pass a certain portion of the HTTP Get requests to the main computer. The remaining requests are filtered out as they will be handled by other machines. If one of the load sharing machine fails, filter settings on all the active machines are appropriately modified to redistribute the traffic.
Standby Synchronization
For redundancy to work, the standby unit needs to be kept synchronized withthe active unit at all times. This is required so that the standby can fit intothe active's boots in case the active fails. The standby synchronization can beachieved in the following ways:
Bus Cycle Level Synchronization
Memory Mirroring
Message Level Synchronization
Checkpoint Level Synchronization
Reconciliation on Takeover
Bus Cycle Level Synchronization
In this scheme the active and the standby are locked at processor bus cyclelevel. To keep itself synchronized with the active unit, the standby unitwatches each processor instruction that is performed by active. Then, itperforms the same instruction in the next bus cycle and compares the output withthat of the active unit. If the output does not match, the standby mighttakeover and become active. The main disadvantage here is that specializedhardware is needed to implement this scheme. Also, bus cycle levelsynchronization introduces wait states in bus cycle execution. This will lowerthe overall performance of the processor.
Memory Mirroring
Here, the system is configured with two CPUs  and two parity basedmemory cards. One of the CPU is active and the other is standby. Both the memorycards are driven by the active CPU. No memory is attached to the standby unit.Each memory write by the active is made to both the memory cards. The data bitsand the parity bits are updated individually on both the memory cards. On everymemory read, the output of both the memory cards is compared. If a mismatch isdetected, the processor believes the memory card with correct parity bit. Theother memory card is marked suspected and a fault trigger is generated.
The standby unit continuously monitors the health of the active unit bysanity punching or watchdog mechanism. If a fault is detected, the standby takesover both the memory cards. Since the application context is kept in memory, thenew active processor gets the application context.
The main disadvantage here is that specialized hardware is needed toimplement this scheme. Also, memory mirroring introduces wait states in buscycle execution. This will lower the overall performance of the processor.
Message Level Synchronization
In this scheme, active unit passes all the messages received from externalsources to the standby. The standby performs all the actions as though it wereactive with the difference that no output is sent to the external world. The main advantage here is that no special hardware is required to implementthis. The scheme is practical only in conditions where the processor is requiredto take fairly simple decisions. In cases of complex decisions, thesynchronization can be easily lost  if the two processor take differentdecisions on the same input message.
Checkpoint Level Synchronization
To some extent, this one is like message level synchronization as activeconveys synchronization information in terms of messages to standby. Thedifference is that all the external world messages are not conveyed. Theinformation is conveyed only about predefined milestones. For example, in a CallProcessing system, checkpoints may be passed only when the call reachesconversation or is cleared.  If standby takes over, all the calls inconversation would be retained whereas all calls in transient states will belost. Resource information for the transient calls may be retrieved by runningsoftware audits with other modules. This scheme is not prone to loss ofsynchronization under normal conditions. Also, the message traffic to thestandby is reduced, thus improving the overall performance of the active.
Reconciliation on Takeover
In this scheme, no synchronization between the active and the standby. Whenthe standby takes over, it recovers the processor context by requestinginformation with other modules in the system. The advantage of this scheme liesin its simplicity of implementation. Also, there is no performance overhead dueto mate synchronization. The disadvantage is that the standby take over may bedelayed due to reconciliation requirements.