Bayesian networks [Pearl1988] are increasingly popular tools for modeling uncertainty in intelligent systems. With practical models reaching the size of several hundreds of variables (e.g., [Pradhan et al.1994,Conati et al.1997]), it becomes increasingly important to address the problem of feasibility of probabilistic inference. Even though several ingenious exact algorithms have been proposed, in very large models they all stumble on the theoretically demonstrated NP-hardness of inference [Cooper1990]. The significance of this result can be observed in practice -- exact algorithms applied to large, densely connected practical networks require either a prohibitive amount of memory or a prohibitive amount of computation and are unable to complete. While approximating inference to any desired precision has been shown to be NP-hard as well [Dagum and Luby1993], it is for very complex networks the only alternative that will produce any result at all. Furthermore, while obtaining the result is crucial in all applications, precision guarantees may not be critical for some types of problems and can be traded off against the speed of computation.
A prominent subclass of approximate algorithms is the family of stochastic sampling algorithms (also called stochastic simulation or Monte Carlo algorithms). The precision obtained by stochastic sampling generally increases with the number of samples generated and is fairly unaffected by the network size. Execution time is fairly independent of the topology of the network and is linear in the number of samples. Computation can be interrupted at any time, yielding an anytime property of the algorithms, important in time-critical applications.
While stochastic sampling performs very well in predictive inference, diagnostic reasoning, i.e., reasoning from observed evidence nodes to their ancestors in the network often exhibits poor convergence. When the number of observations increases, especially if these observations are unlikely a-priori, stochastic sampling often fails to converge to reasonable estimates of the posterior probabilities. Although this problem has been known since the very first sampling algorithm was proposed by Henrion [1988], little has been done to address it effectively. Furthermore, various sampling algorithms proposed were tested on simple and small networks, or networks with special topology, without the presence of extremely unlikely evidence and the practical significance of this problem has been underestimated. Given a typical number of samples used in real-time that are feasible on today's hardware, say 106 samples, the behavior of a stochastic sampling algorithm will be drastically different for different size networks. While in a network consisting of 10 nodes and a few observations, it may be possible to converge to exact probabilities, in very large networks only a negligibly small fraction of the total sample space will be probed. One of the practical Bayesian network models that we used in our tests, a subset of the CPCS network [Pradhan et al.1994], consists of 179 nodes. Its total sample space is larger than 1061. With 106 samples, we can sample only 10-55 fraction of the sample space.
We believe that it is crucial (1) to study the feasibility and convergence properties of sampling algorithms on very large practical networks, and (2) to develop sampling algorithms that will show good convergence under extreme, yet practical conditions, such as evidential reasoning given extremely unlikely evidence. After all, small networks can be updated using any of the existing exact algorithms -- it is precisely the very large networks where stochastic sampling can be most useful. As to the likelihood of evidence, we know that stochastic sampling will generally perform well when it is high [Henrion1988]. So, it is important to look at those cases in which evidence is very unlikely. In this paper, we test two existing state of the art stochastic sampling algorithms for Bayesian networks, likelihood weighting [Fung and Chang1989,Shachter and Peot1989] and self-importance sampling [Shachter and Peot1989], on a subset of the CPCS network with extremely unlikely evidence. We show that they both exhibit similarly poor convergence rates. We propose a new sampling algorithm, that we call the adaptive importance sampling for Bayesian networks (AIS-BN), which is suitable for evidential reasoning in large multiply-connected Bayesian networks. The AIS-BN algorithm is based on importance sampling, which is a widely applied method for variance reduction in simulation that has also been applied in Bayesian networks (e.g., [Shachter and Peot1989]). We demonstrate empirically on three large practical Bayesian network models that the AIS-BN algorithm consistently outperforms the other two algorithms. In the majority of the test cases, it achieved over two orders of magnitude improvement in convergence. Improvement in speed given a desired precision is even more dramatic, although we are unable to report numerical results here, as the other algorithms never achieved the precision reached even by the first few iterations of the AIS-BN algorithm. The main sources of improvement are: (1) two heuristics for the initialization of the importance function that are based on the theoretical properties of importance sampling in finite-dimensional integrals and the structural advantages of Bayesian networks, (2) a smooth learning method for updating the importance function, and (3) a dynamic weighting function for combining samples from different stages of the algorithm. We study the value of the two heuristics used in the AIS-BN algorithm: (1) initialization of the probability distributions of parents of evidence nodes to uniform distribution and (2) adjusting very small probabilities in the conditional probability tables, and show that they both play an important role in the AIS-BN algorithm but only a moderate role in the existing algorithms.
The remainder of this paper is structured as follows. Section 2 first gives a general introduction to importance sampling in the domain of finite-dimensional integrals, where it was originally proposed. We show how importance sampling can be used to compute probabilities in Bayesian networks and how it can draw additional benefits from the graphical structure of the network. Then we develop a generalized sampling scheme that will aid us in reviewing the previously proposed sampling algorithms and in describing the AIS-BN algorithm. Section 3 describes the AIS-BN algorithm. We propose two heuristics for initialization of the importance function and discuss their theoretical foundations. We describe a smooth learning method for the importance function and a dynamic weighting function for combining samples from different stages of the algorithm. Section 4 describes the empirical evaluation of the AIS-BN algorithm. Finally, Section 5 suggests several possible improvements to the AIS-BN algorithm, possible applications of our learning scheme, and directions for future work.