3.2 Random faults

Next: 3.3 On-line reconfiguration Up: 3 Routing around faults Previous: 3.1 Worst case faults

3.2 Random faults

Random faults are much easier to tolerate than worst-case faults. In fact, any network can be made to tolerate a large number of randomly placed faults simply by making multiple copies of its switches and edges. For example, suppose that an n-switch network G is replaced by a network in which for each switch u in G there are a pair of switches u and in , and for each edge in G there are four edges , , , and . Then can simulate G provided that there is no pair of switches u and in that simultaneously fail. If each switch fails independently with probability , then the expected number of failures is , and the probability that any particular pair of switches u and both fail is . Since there are n pairs of switches, the probability that any pair fails is at most . This technique is easily generalized. By making c copies of each switch and copies of each edge, any n-switch network can be made to tolerate a failure rate of with probability .

By a similar argument, even if each switch in fails with some fixed constant probability then, with high probability, can simulate a constant fraction of the switches in G. In general, however, there is no guarantee that these switches in G will be connected in a useful fashion. As the following theorem shows, however, even if the switches fail with some constant probability, an N-input multibutterfly will have some set of inputs and outputs between which any permutation can be routed in steps, with high probability. Furthermore, an N-input multi-Benes network can tolerate constant failure probability and still route permutations of packets in time. The only other networks that are known to tolerate constant failure probability are the N-switch hypercube, which can route any permutations of N packets in time, with high probability, but has degree [10], and the N-switch mesh, which can route permutations of packets in time [12].

The strategy for tolerating random faults is the same as the strategy of Section 3.1 for tolerating worst-case faults. We first examine each splitter to determine if more than an fraction of the input switches in that splitter are faulty, where and . If so, then we erase the splitter and all of its descendant switches and outputs. Then we propagate faults back from level to level 0. A switch is declared to be faulty if at least half of its upper input edges or output edges lead to faulty switches that have not been erased. The following theorem shows that, with high probability, this strategy leaves a constant fraction of the network intact.

theorem632

Proof: The hard part of the proof is showing that we don't erase too many outputs.

We begin by deriving an upper bound on the probability that more than input switches fail in an M-input splitter. Let denote the number of input switches that fail in an M-input splitter. Then has a binomial distribution, and

Setting , we have . Using the inequality and letting , we have , where for .

We can analyze the number of erased splitters on each level in a similar fashion. Consider a level of the network that contains M-input splitters. Using the fact that each splitter on this level is erased independently with probability at most , we can show that with probability at least , the number of erased splitters is at most for any constant . For small and all M, c can be made arbitrarily close to 0. Hence, with probability at least , at most outputs are erased due to faults occuring in splitters with M inputs. Summing over , we find that with probability at last , at most outputs are erased overall. Hence, at least outputs remain, where can be made close to 1 by setting to be close to zero.

Once all of the blocks containing more than an fraction of faulty switches are erased, we can apply Lemmas 3.2 and 3.3 to show that there aren't too many propagated faults on any level. In order to apply Lemma 3.3, we must bound the number of switches that fail on any level. To do this, we bound the binomial distribution to show that at most switches fail on a level with probability , where . By Lemma 3.3, if there are at most switch failures on any level, then there are at most propagated faults, and total faults on any level. Thus, with probability at least , some set of inputs and outputs remain, where . (Notice that since c and can be made arbitrarily small.) The time to route between these inputs and outputs is given by Theorem 3.4.

to .667emto .667em

Next: 3.3 On-line reconfiguration Up: 3 Routing around faults Previous: 3.1 Worst case faults

Bruce Maggs
Mon Jul 22 18:45:42 EDT 1996