In the previous section we described how variational transformations are derived for individual findings in the QMR model; we now discuss how to utilize these transformations in the context of an overall inference algorithm.
Conceptually the overall approach is straightforward. Each transformation involves replacing an exact conditional probability of a finding with a lower bound and an upper bound:
Given that such transformations can be viewed as delinking the ith finding node from the graph, we see that the transformations not only yield bounds, but also yield a simplified graphical structure. We can imagine introducing transformations sequentially until the graph is sparse enough that exact methods become feasible. At that point we stop introducing transformations and run an exact algorithm.
There is a problem with this approach, however. We need to decide at each step which node to transform, and this requires an assessment of the effect on overall accuracy of transforming the node. We might imagine calculating the change in a probability of interest both before and after a given transformation, and choosing to transform that node that yields the least change to our target probability. Unfortunately we are unable to calculate probabilities in the original untransformed graph, and thus we are unable to assess the effect of transforming any one node. We are unable to get the algorithm started.
Suppose instead that we work backwards. That is, we introduce transformations for all of the findings, reducing the graph to an entirely decoupled set of nodes. We optimize the variational parameters for this fully transformed graph (more on optimization of the variational parameters below). For this graph inference is trivial. Moreover, it is also easy to calculate the effect of reinstating a single exact conditional at one node: we choose to reinstate that node which yields the most change.
Consider in particular the case of the upper bounds (lower bounds are analogous). Each transformation introduces an upper bound on a conditional probability . Thus the likelihood of observing the (positive) findings is also upper bounded by its variational counterpart :
We can assess the accuracy of each variational transformation after introducing and optimizing the variational transformations for all the positive findings. Separately for each positive finding we replace the variationally transformed conditional probability with the corresponding exact conditional and compute the difference between the resulting bounds on the likelihood of the observations:
where is computed without transforming the positive finding. The larger the difference is, the worse the variational transformation is. We should therefore introduce the transformations in the ascending order of s. Put another way, we should treat exactly (not transform) those conditional probabilities whose measure is large.
In practice, an intelligent method for ordering the transformations is critical. Figure 2 compares the calculation of likelihoods based on the measure as opposed to a method that chooses the ordering of transformations at random. The plot corresponds to a representative diagnostic case, and shows the upper bounds on the log-likelihoods of the observed findings as a function of the number of conditional probabilities that were left intact (i.e. not transformed). Note that the upper bound must improve (decrease) with fewer transformations. The results are striking--the choice of ordering has a large effect on accuracy (note that the plot is on a log-scale).
Figure 2: The upper bound on the log-likelihood for the delta
method of removing transformations (solid line) and a method
that bases the choice on a random ordering (dashed line).
Note also that the curve for the proposed ranking is convex; thus the bound improves less the fewer transformations there are left. This is because we first remove the worst transformations, replacing them with the exact conditionals. The remaining transformations are better as indicated by the delta measure and thus the bound improves less with further replacements.
We make no claims for optimality of the delta method; it is simply a useful heuristic that allows us to choose an ordering for variational transformations in a computationally efficient way. Note also that our implementation of the method optimizes the variational parameters only once at the outset and chooses the ordering of further transformations based on these fixed parameters. These parameters are suboptimal for graphs in which substantial numbers of nodes have been reinstated, but we have found in practice that this simplified algorithm still produces reasonable orderings.
Once we have decided which nodes to reinstate, the approximate inference algorithm can be run. We introduce transformations at those nodes that were left transformed by the ordering algorithm. The product of all of the exact conditional probabilities in the graph with the transformed conditional probabilities yields an upper or lower bound on the overall joint probability associated with the graph (the product of bounds is a bound). Sums of bounds are still bounds, and thus the likelihood (the marginal probability of the findings) is bounded by summing across the bounds on the joint probability. In particular, an upper bound on the likelihood is obtained via:
and the corresponding lower bound on the likelihood is obtained similarly:
In both cases we assume that the graph has been sufficiently simplified by the variational transformations so that the sums can be performed efficiently.
The expressions in Eq. (29) and Eq. (30) yield upper and lower bounds for arbitrary values of the variational parameters and q. We wish to obtain the tightest possible bounds, thus we optimize these expressions with respect to and q. We minimize with respect to and maximize with respect to q. Appendix 6 discusses these optimization problems in detail. It turns out that the upper bound is convex in the and thus the adjustment of the variational parameters for the upper bound reduces to a convex optimization problem that can be carried out efficiently and reliably (there are no local minima). For the lower bound it turns out that the maximization can be carried out via the EM algorithm.
Finally, although bounds on the likelihood are useful, our ultimate goal is to approximate the marginal posterior probabilities . There are two basic approaches to utilizing the variational bounds in Eq. (29) and Eq. (30) for this purpose. The first method, which will be our emphasis in the current paper, involves using the transformed probability model (the model based either on upper or lower bounds) as a computationally efficient surrogate for the original probability model. That is, we tune the variational parameters of the transformed model by requiring that the model give the tightest possible bound on the likelihood. We then use the tuned transformed model as an inference engine to provide approximations to other probabilities of interest, in particular the marginal posterior probabilities . The approximations found in this manner are not bounds, but are computationally efficient approximations. We provide empirical data in the following section that show that this approach indeed yields good approximations to the marginal posteriors for the QMR-DT network.
A more ambitious goal is to obtain interval bounds for the marginal posterior probabilities themselves. To this end, let denote the combined event that the QMR-DT model generates the observed findings and that the disease takes the value . These bounds follow directly from:
where is a product of upper-bound transformed conditional probabilities and exact (untransformed) conditionals. Analogously we can compute a lower bound by applying the lower bound transformations:
Combining these bounds we can obtain interval bounds on the posterior marginal probabilities for the diseases (cf. Draper & Hanks 1994):
where is the binary complement of .