There is a fundamental trade-off in the AIS-BN algorithm between the time spent on learning the importance function and the time spent on sampling. Our current approach, which we believe to be reasonable, is to stop learning at the point when the importance function is good enough. In our experiments we stopped learning after 10 iterations.
There are several ways of improving the initialization of the
conditional probability tables at the outset of the AIS-BN
algorithm. In the current version of the algorithm, we initialize
the ICPT table of every parent N of an evidence node E
(
N Pa(E),
E
) to the uniform distribution when
Pr(E = e) < 1/(2 . nE). This can be improved further. We can
extend the initialization to those nodes that are severely
affected by the evidence. They can be identified by examining the
network structure and local CPTs.
We can view the learning process of the AIS-BN algorithm as a
network rebuilding process. The algorithm constructs a new network
whose structure is the same as the original network (except that
we delete the evidence nodes and corresponding arcs). The
constructed network models the joint probability distribution
(
\
) in Equation 8,
which approaches the optimal importance function. We use the
learned
to approximate this distribution. If
approximates
Pr(
|
) accurately
enough, we can use this new network to solve other approximate
tasks, such as the problem of computing the Maximum A-Posterior
assignment (MAP) [Pearl1988], finding k most likely scenarios
[Seroussi and Golmard1994], etc. A large advantage of this approach is that
we can solve each of these problems as if the network had no
evidence nodes.
We know that Markov blanket scoring can improve convergence rates
in some sampling algorithms [Shwe and Cooper1991]. It may also be applied
to the AIS-BN algorithm to improve its convergence rate. According
to Property 4 (Section 2.1), any technique that can
reduce the variance
will reduce the
variance of
(
) and correspondingly improve
the sampling performance. Since the variance of stratified
sampling [Rubinstein1981] is never much worse than that
of random sampling, and can be much better, it can improve the
convergence rate. We expect some other variance reduction methods
in statistics, such as: (i) the expected value of a random
variable; (ii) antithetic variants correlations (stratified
sampling, Latin hypercube sampling, etc.); and (iii)
systematic sampling, will also improve the sampling performance.
Current learning algorithm used a simple approach. Some heuristic learning methods, such as adjusting learning rates according to changes of the error [Jacobs1988], should also be applicable to our algorithm. There are several tunable parameters in the AIS-BN algorithm. Finding the optimal values of these parameters for any given network is another interesting research topic.
It is worth observing that the plots presented in Figure 8 are fairly flat. In other words, in our tests the convergence of the sampling algorithms did not depend too strongly on the probability of evidence. This seems to contradict the common belief that forward sampling schemes suffer from unlikely evidence. AIS-BN for one shows a fairly flat plot. The convergence of the SIS and LW algorithms seems to decrease slightly with unlikely evidence. It is possible that all three algorithms will perform much worse when the probability of evidence drops below some threshold value, which our tests failed to approach. Until this relationship has been studied carefully, we conjecture that the probability of evidence is not a good measure of difficulty of approximate inference.
Given that the problem of approximating probabilistic inference is NP-hard, there exist networks that will be challenging for any algorithm and we have no doubt that even the AIS-BN algorithm will perform poorly on them. To the day, we have not found such networks. There is one characteristic of networks that may be challenging to the AIS-BN algorithm. In general, when the number of parameters that need to be learned by the AIS-BN algorithm increases, its performance will deteriorate. Nodes with many parents, for example, are challenging to the AIS-BN learning algorithm, as it has to update the ICPT tables under all combinations of the parent nodes. It is possible that conditional probability distributions with causal independence properties, such as Noisy-OR distributions [Pearl1988,Henrion1989,Diez1993,Srinivas1993,Heckerman and Breese1994], common in very large practical networks, can be treated differently and lead to considerable savings in the learning time.
One direction of testing approximate algorithms, suggested to us by a reviewer, is to use very large networks for which exact solution cannot be computed at all. In this case, one can try to infer from the difference in variance at various stages of the algorithm whether it is converging or not. This is a very interesting idea that is worth exploring, especially when combined with theoretical work on stopping criteria in the line of the work of Dagum and Luby [1997].