The main difference between various stochastic sampling algorithms is in how they process Steps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1.
Probabilistic logic sampling [Henrion1988] is the simplest and the first proposed sampling algorithm for Bayesian networks. The importance function is initialized in Step 2 to Pr() and never updated (Step 7 is null). Without evidence, Pr() is the optimal importance function for the evidence set, which is empty anyway. It escapes most authors that Pr() may be not the optimal importance function for Pr(A = a), A , when A is not a root node. A mismatch between the optimal and the actually used importance function may result in a large variance. The sampling process with evidence is the same as without evidence except that in Step 10 we do not count the scores for those samples that are inconsistent with the observed evidence, which amounts to discarding them. When the evidence is very unlikely, there is a large difference between Pr() and the optimal importance function. Effectively, most samples are discarded and the performance of logic sampling deteriorates badly.
Likelihood weighting (LW) [Fung and Chang1989,Shachter and Peot1989] enhances the logic sampling in that it never discards samples. In likelihood weighting, the importance function in Step 2 is
Backward sampling [Fung and del Favero1994] changes Step 1 of our generic algorithm and allows for generating samples from evidence nodes in the direction that is opposite to the topological order of nodes in the network. In Step 2, backward sampling uses the likelihood of some of the observed evidence and some instantiated nodes to calculate Pr0(\). Although Fung and del Favero mentioned the possibility of dynamic node ordering, they did not propose any scheme for updating the importance function in Step 7. Backward sampling suffers from problems that are similar to those of likelihood weighting, i.e., a possible mismatch between its importance function and the optimal importance function can lead to poor convergence.
Importance sampling [Shachter and Peot1989] is the same as our generic sampling algorithm. Shachter and Peot introduced two variants of importance sampling: self-importance (SIS) and heuristic importance. The importance function used in the first step of the self-importance algorithm is
A separate group of stochastic sampling methods is formed by so-called Markov Chain Monte Carlo (MCMC) methods that are divided into Gibbs sampling, Metropolis sampling, and Hybrid Monte Carlo sampling [Geman and Geman1984,Gilks et al.1996,MacKay1998]. Roughly speaking, these methods draw random samples from an unknown target distribution f () by biasing the search for this distribution towards higher probability regions. When applied to Bayesian networks [Pearl1987,Chavez and Cooper1990] this approach determines the sampling distribution of a variable from its previous sample given its Markov blanket [Pearl1988]. This corresponds to updating Prk(\) when sampling every node. Prk(\) will converge to the optimal importance function for Pr() if Pr0(\) satisfies some ergodic properties [York1992]. Since the convergence to the limiting distribution is very slow and calculating updates of the sampling distribution is costly, these algorithms are not used in practice as often as the simple likelihood weighting scheme.
There are also some other simulation algorithms, such as bounded variance algorithm [Dagum and Luby1997] and the AA algorithm [Dagum et al.1995], which are essentially based on the LW algorithm and the Stopping-Rule Theorem [Dagum et al.1995]. Cano et al. [1996] proposed another importance sampling algorithm that performed somewhat better than LW in cases with extreme probability distributions, but, as the authors state, in general cases it ``produced similar results to the likelihood weighting algorithm.'' Hernandez et al. [1998] also applied importance sampling and reported a moderate improvement on likelihood weighting.