The dimensionality of the problem of Bayesian network inference is
equal to the number of variables in a network, which in the
networks considered in this paper can be very high. As a result,
the learning space of the optimal importance function is very
large. Choice of the initial importance function
Pr0(\
) is an important factor affecting the
learning -- an initial value of the importance function that is
close to the optimal importance function can greatly affect the
speed of convergence. In this section, we present two heuristics
that help to achieve this goal.
Due to their explicit encoding of the structure of a decomposable
joint probability distribution, Bayesian networks offer
computational advantages compared to finite-dimensional integrals.
A possible first approximation of the optimal importance function
is the prior probability distribution over the network variables,
Pr(. We propose an improvement on this
initialization. We know that the effect of evidence nodes on a
node will be attenuated as the path length of that node to
evidence nodes is increased [Henrion1989] and the most
affected nodes are the direct ancestors of the evidence nodes.
Initializing the ICPT tables of the parents of the evidence
nodes to uniform distributions in our experience improves the
convergence rate. Furthermore, the CPT tables of the parents of
an evidence node E may be not favorable to the observed state
e if the probability of E = e without any condition is less than
a small value, such as
Pr(E = e) < 1/(2 . nE), where nE
is the number of outcomes of node E. Based on this observation,
we change the CPT tables of the parents of an evidence node E
to uniform distributions in our experiment only when
Pr(E = e) < 1/(2 . nE), otherwise we leave them unchanged.
This kind of initialization involves the knowledge of Pr(E = e),
the marginal probability without evidence. Probabilistic logic
sampling [Henrion1988] enhanced by Latin hypercube sampling
[Cheng and Druzdzel2000b] or quasi-Monte Carlo methods [Cheng and Druzdzel2000a]
will produce a very good estimate of Pr(E = e). This is an
one-time effort that can be made at the model building stage and
is worth pursuing to any desired precision.
Another serious problem related to sampling are extremely small
probabilities. Suppose there exists a root node with a state s
that has the prior probability
Pr(s) = 0.0001. Let the
posterior probability of this state given evidence be
Pr(s|) = 0.8. A simple calculation shows that if we
update the importance function every 1, 000 samples, we can
expect to hit s only once every 10 updates. Thus s's
convergence rate will be very slow. We can overcome this problem
by setting a threshold
and replacing every probability
p <
in the network by
.2 At the same time, we
subtract (
- p) from the largest probability in the same
conditional probability distribution. For example, the value of
= 10/l, where l is the updating interval, will allow us
to sample 10 times more often in the first stage of the
algorithm. If this state turns out to be more likely (having a
large weight), we can increase its probability even more in order
to converge to the correct answer faster. Considering that we
should avoid
f (
)
g(
) - I . f (
)
in an unimportant region as discussed in
Section 2.1, we need to make this threshold larger.
We have found that the convergence rate is quite sensitive to this
threshold. Based on our empirical tests, we suggest to use
= 0.04 in networks whose maximum number of outcomes per node does
not exceed five. A smaller threshold might lead to fast
convergence in some cases but slow convergence in others. If one
threshold does not work, changing it in a specific network will
usually improve convergence rate.