There is a fundamental trade-off in the AIS-BN algorithm between the time spent on learning the importance function and the time spent on sampling. Our current approach, which we believe to be reasonable, is to stop learning at the point when the importance function is good enough. In our experiments we stopped learning after 10 iterations.
There are several ways of improving the initialization of the conditional probability tables at the outset of the AIS-BN algorithm. In the current version of the algorithm, we initialize the ICPT table of every parent N of an evidence node E ( N Pa(E), E ) to the uniform distribution when Pr(E = e) < 1/(2 . nE). This can be improved further. We can extend the initialization to those nodes that are severely affected by the evidence. They can be identified by examining the network structure and local CPTs.
We can view the learning process of the AIS-BN algorithm as a network rebuilding process. The algorithm constructs a new network whose structure is the same as the original network (except that we delete the evidence nodes and corresponding arcs). The constructed network models the joint probability distribution (\) in Equation 8, which approaches the optimal importance function. We use the learned to approximate this distribution. If approximates Pr(|) accurately enough, we can use this new network to solve other approximate tasks, such as the problem of computing the Maximum A-Posterior assignment (MAP) [Pearl1988], finding k most likely scenarios [Seroussi and Golmard1994], etc. A large advantage of this approach is that we can solve each of these problems as if the network had no evidence nodes.
We know that Markov blanket scoring can improve convergence rates in some sampling algorithms [Shwe and Cooper1991]. It may also be applied to the AIS-BN algorithm to improve its convergence rate. According to Property 4 (Section 2.1), any technique that can reduce the variance will reduce the variance of () and correspondingly improve the sampling performance. Since the variance of stratified sampling [Rubinstein1981] is never much worse than that of random sampling, and can be much better, it can improve the convergence rate. We expect some other variance reduction methods in statistics, such as: (i) the expected value of a random variable; (ii) antithetic variants correlations (stratified sampling, Latin hypercube sampling, etc.); and (iii) systematic sampling, will also improve the sampling performance.
Current learning algorithm used a simple approach. Some heuristic learning methods, such as adjusting learning rates according to changes of the error [Jacobs1988], should also be applicable to our algorithm. There are several tunable parameters in the AIS-BN algorithm. Finding the optimal values of these parameters for any given network is another interesting research topic.
It is worth observing that the plots presented in Figure 8 are fairly flat. In other words, in our tests the convergence of the sampling algorithms did not depend too strongly on the probability of evidence. This seems to contradict the common belief that forward sampling schemes suffer from unlikely evidence. AIS-BN for one shows a fairly flat plot. The convergence of the SIS and LW algorithms seems to decrease slightly with unlikely evidence. It is possible that all three algorithms will perform much worse when the probability of evidence drops below some threshold value, which our tests failed to approach. Until this relationship has been studied carefully, we conjecture that the probability of evidence is not a good measure of difficulty of approximate inference.
Given that the problem of approximating probabilistic inference is NP-hard, there exist networks that will be challenging for any algorithm and we have no doubt that even the AIS-BN algorithm will perform poorly on them. To the day, we have not found such networks. There is one characteristic of networks that may be challenging to the AIS-BN algorithm. In general, when the number of parameters that need to be learned by the AIS-BN algorithm increases, its performance will deteriorate. Nodes with many parents, for example, are challenging to the AIS-BN learning algorithm, as it has to update the ICPT tables under all combinations of the parent nodes. It is possible that conditional probability distributions with causal independence properties, such as Noisy-OR distributions [Pearl1988,Henrion1989,Diez1993,Srinivas1993,Heckerman and Breese1994], common in very large practical networks, can be treated differently and lead to considerable savings in the learning time.
One direction of testing approximate algorithms, suggested to us by a reviewer, is to use very large networks for which exact solution cannot be computed at all. In this case, one can try to infer from the difference in variance at various stages of the algorithm whether it is converging or not. This is a very interesting idea that is worth exploring, especially when combined with theoretical work on stopping criteria in the line of the work of Dagum and Luby [1997].