next up previous
Next: 3.3 On-line reconfiguration Up: 3 Routing around faults Previous: 3.1 Worst case faults

3.2 Random faults

 

Random faults are much easier to tolerate than worst-case faults. In fact, any network can be made to tolerate a large number of randomly placed faults simply by making multiple copies of its switches and edges. For example, suppose that an n-switch network G is replaced by a network tex2html_wrap_inline2073 in which for each switch u in G there are a pair of switches u and tex2html_wrap_inline2081 in tex2html_wrap_inline2083 , and for each edge tex2html_wrap_inline2085 in G there are four edges tex2html_wrap_inline2089 , tex2html_wrap_inline2091 , tex2html_wrap_inline2093 , and tex2html_wrap_inline2095 . Then tex2html_wrap_inline2097 can simulate G provided that there is no pair of switches u and tex2html_wrap_inline2103 in tex2html_wrap_inline2105 that simultaneously fail. If each switch fails independently with probability tex2html_wrap_inline2107 , then the expected number of failures is tex2html_wrap_inline2109 , and the probability that any particular pair of switches u and tex2html_wrap_inline2113 both fail is tex2html_wrap_inline2115 . Since there are n pairs of switches, the probability that any pair fails is at most tex2html_wrap_inline2119 . This technique is easily generalized. By making c copies of each switch and tex2html_wrap_inline2123 copies of each edge, any n-switch network can be made to tolerate a failure rate of tex2html_wrap_inline2127 with probability tex2html_wrap_inline2129 .

By a similar argument, even if each switch in tex2html_wrap_inline2131 fails with some fixed constant probability then, with high probability, tex2html_wrap_inline2133 can simulate a constant fraction of the switches in G. In general, however, there is no guarantee that these switches in G will be connected in a useful fashion. As the following theorem shows, however, even if the switches fail with some constant probability, an N-input multibutterfly will have some set of tex2html_wrap_inline2141 inputs and tex2html_wrap_inline2143 outputs between which any permutation can be routed in tex2html_wrap_inline2145 steps, with high probability. Furthermore, an N-input multi-Benes network can tolerate constant failure probability and still route tex2html_wrap_inline2149 permutations of tex2html_wrap_inline2151 packets in tex2html_wrap_inline2153 time. The only other networks that are known to tolerate constant failure probability are the N-switch hypercube, which can route any tex2html_wrap_inline2157 permutations of N packets in tex2html_wrap_inline2161 time, with high probability, but has degree tex2html_wrap_inline2163 [10], and the N-switch mesh, which can route tex2html_wrap_inline2167 permutations of tex2html_wrap_inline2169 packets in tex2html_wrap_inline2171 time [12].

The strategy for tolerating random faults is the same as the strategy of Section 3.1 for tolerating worst-case faults. We first examine each splitter to determine if more than an tex2html_wrap_inline2173 fraction of the input switches in that splitter are faulty, where tex2html_wrap_inline2175 and tex2html_wrap_inline2177 . If so, then we erase the splitter and all of its descendant switches and outputs. Then we propagate faults back from level tex2html_wrap_inline2179 to level 0. A switch is declared to be faulty if at least half of its upper input edges or output edges lead to faulty switches that have not been erased. The following theorem shows that, with high probability, this strategy leaves a constant fraction of the network intact.

  theorem632

Proof: The hard part of the proof is showing that we don't erase too many outputs.

We begin by deriving an upper bound on the probability that more than tex2html_wrap_inline2197 input switches fail in an M-input splitter. Let tex2html_wrap_inline2201 denote the number of input switches that fail in an M-input splitter. Then tex2html_wrap_inline2205 has a binomial distribution, and

displaymath2207

Setting tex2html_wrap_inline2209 , we have tex2html_wrap_inline2211 . Using the inequality tex2html_wrap_inline2213 and letting tex2html_wrap_inline2215 , we have tex2html_wrap_inline2217 , where tex2html_wrap_inline2219 for tex2html_wrap_inline2221 .

We can analyze the number of erased splitters on each level in a similar fashion. Consider a level of the network that contains tex2html_wrap_inline2223 M-input splitters. Using the fact that each splitter on this level is erased independently with probability at most tex2html_wrap_inline2227 , we can show that with probability at least tex2html_wrap_inline2229 , the number of erased splitters is at most tex2html_wrap_inline2231 for any constant tex2html_wrap_inline2233 . For small tex2html_wrap_inline2235 and all M, c can be made arbitrarily close to 0. Hence, with probability at least tex2html_wrap_inline2243 , at most tex2html_wrap_inline2245 outputs are erased due to faults occuring in splitters with M inputs. Summing over tex2html_wrap_inline2249 , we find that with probability at last tex2html_wrap_inline2251 , at most tex2html_wrap_inline2253 outputs are erased overall. Hence, at least tex2html_wrap_inline2255 outputs remain, where tex2html_wrap_inline2257 can be made close to 1 by setting tex2html_wrap_inline2261 to be close to zero.

Once all of the blocks containing more than an tex2html_wrap_inline2263 fraction of faulty switches are erased, we can apply Lemmas 3.2 and 3.3 to show that there aren't too many propagated faults on any level. In order to apply Lemma 3.3, we must bound the number of switches that fail on any level. To do this, we bound the binomial distribution to show that at most tex2html_wrap_inline2265 switches fail on a level with probability tex2html_wrap_inline2267 , where tex2html_wrap_inline2269 . By Lemma 3.3, if there are at most tex2html_wrap_inline2271 switch failures on any level, then there are at most tex2html_wrap_inline2273 propagated faults, and tex2html_wrap_inline2275 total faults on any level. Thus, with probability at least tex2html_wrap_inline2277 , some set of tex2html_wrap_inline2279 inputs and tex2html_wrap_inline2281 outputs remain, where tex2html_wrap_inline2283 . (Notice that tex2html_wrap_inline2285 since c and tex2html_wrap_inline2289 can be made arbitrarily small.) The time to route between these inputs and outputs is given by Theorem 3.4.


to .667emto .667em



next up previous
Next: 3.3 On-line reconfiguration Up: 3 Routing around faults Previous: 3.1 Worst case faults



Bruce Maggs
Mon Jul 22 18:45:42 EDT 1996