The largest network that has been tested using sampling algorithms is QMR-DT (Quick Medical Reference -- Decision Theoretic) [Shwe et al.1991,Shwe and Cooper1991], which contains 534 adult diseases and 4,040 findings, with 40,740 arcs depicting disease-to-finding dependencies. The QMR-DT network belongs to a class of special bipartite networks and its structure is often referred to as BN2O [Henrion1991], because of its two-layer composition: disease nodes in the top layer and finding nodes in the bottom layer. Shwe and colleagues used an algorithm combining self-importance and heuristic importance and tested its convergence properties on the QMR-DT network. But since the heuristic method iterative tabular Bayes (ITB) that makes use of a version of Bayes' rule is designed for the BN2O networks, it cannot be generalized to arbitrary networks. Although Shwe and colleagues concluded that Markov blanket scoring and self-importance sampling significantly improve the convergence rate in their model, we cannot extend this conclusion to general networks. The computation of Markov blanket scoring is more complex in a general multi-connected network than in a BN2O network. Also, the experiments conducted lacked a gold-standard posterior probability distribution that could serve to judge the convergence rate.
Pradhan and Dagum [1996] tested an efficient version of the LW algorithm -- bounded variance algorithm [Dagum and Luby1997] and the AA algorithm [Dagum et al.1995] on a 146 node, multiply connected medical diagnostic Bayesian network. One limitation in their tests is that the probability of evidence in the cases selected for testing was rather high. Although over 10% of the cases had the probability of evidence on the order of 10-8 or smaller, a simple calculation based on the reported mean = 34.5 number of evidence nodes, shows that the average probability of an observed state of an evidence node conditional on its direct predecessors was on the order of (10-8)1/34.5 0.59. Given that their algorithm is essentially based on the LW algorithm, based on our tests we suspect that the performance will deteriorate on cases where the evidence is very unlikely. Both algorithms focus on the marginal probability of one hypothesis node. If there are many queried nodes, the efficiency may deteriorate.
We have tested the algorithms discussed in Section 2.3 on several large networks. Our experimental results show that in cases with very unlikely evidence, none of these algorithms converges to reasonable estimates of the posterior probabilities within a reasonable amount of time. The convergence becomes worse as the number of evidence nodes increases. Thus, when using these algorithms in very large networks, we simply cannot trust the results. We will present results of tests of the LW and SIS algorithms in more detail in Section 4.