Random faults are much easier to tolerate than worst-case faults. In
fact, any network can be made to tolerate a large number of randomly
placed faults simply by making multiple copies of its switches and
edges. For example, suppose that an n-switch network G is
replaced by a network in which for each switch u in G there
are a pair of switches u and
in
, and for each edge
in G there are four edges
,
,
, and
. Then
can simulate G provided that
there is no pair of switches u and
in
that simultaneously
fail. If each switch fails independently with probability
, then the expected number of failures is
,
and the probability that any particular pair of switches u and
both fail is
. Since there are n pairs of switches, the
probability that any pair fails is at most
. This
technique is easily generalized. By making c copies of each switch
and
copies of each edge, any n-switch network can be made to
tolerate a failure rate of
with probability
.
By a similar argument, even if each switch in fails with some
fixed constant probability then, with high probability,
can
simulate a constant fraction of the switches in G. In general,
however, there is no guarantee that these switches in G will be
connected in a useful fashion. As the following theorem shows,
however, even if the switches fail with some constant probability, an
N-input multibutterfly will have some set of
inputs and
outputs between which any permutation can be routed in
steps, with high probability. Furthermore, an N-input
multi-Benes network can tolerate constant failure probability and
still route
permutations of
packets in
time. The only other networks that are known to tolerate constant
failure probability are the N-switch hypercube, which can route any
permutations of N packets in
time, with high probability,
but has degree
[10], and the N-switch mesh, which can route
permutations of
packets in
time [12].
The strategy for tolerating random faults is the same as the strategy
of Section 3.1 for tolerating worst-case faults.
We first examine each splitter to determine if more than an
fraction of the input switches in that splitter are faulty,
where
and
. If so, then we erase the splitter and
all of its descendant switches and outputs. Then we propagate faults
back from level
to level 0. A switch is declared to be
faulty if at least half of its upper input edges or output edges lead
to faulty switches that have not been erased. The following theorem
shows that, with high probability, this strategy leaves a constant
fraction of the network intact.
Proof: The hard part of the proof is showing that we don't erase too many outputs.
We begin by deriving an upper bound on the probability that more than
input switches fail in an M-input splitter. Let
denote the number of input switches that fail in an M-input
splitter. Then
has a binomial distribution, and
Setting , we have
. Using the inequality
and letting
, we have
, where
for
.
We can analyze the number of erased splitters on each level in a
similar fashion. Consider a level of the network that contains
M-input splitters. Using the fact that each splitter on this level
is erased independently with probability at most
, we
can show that with probability at least
,
the number of erased splitters is at most
for any
constant
. For small
and all
M, c can be made arbitrarily close to 0. Hence, with
probability at least
, at most
outputs are erased due to faults occuring in splitters with M
inputs. Summing over
, we find that with
probability at last
, at most
outputs are
erased overall. Hence, at least
outputs remain, where
can be made close to 1 by setting
to be close to
zero.
Once all of the blocks containing more than an fraction
of faulty switches are erased, we can apply
Lemmas 3.2 and 3.3 to show that
there aren't too many propagated faults on any level. In order to
apply Lemma 3.3, we must bound the number of switches
that fail on any level. To do this, we bound the binomial
distribution to show that at most
switches fail on a level with
probability
, where
. By
Lemma 3.3, if there are at most
switch
failures on any level, then there are at most
propagated faults, and
total faults on any level. Thus, with probability at least
, some set of
inputs and
outputs remain, where
. (Notice that
since c and
can be made arbitrarily small.) The
time to route between these inputs and outputs is given by
Theorem 3.4.