Modifying the Sampling Distribution in AIS-BN

Based on the theoretical considerations of Section 2.1, we know that the crucial element of the algorithm is converging on a good approximation of the optimal importance function. In what follows, we first give the optimal importance function for calculating Pr( $\bf E$ = $\bf e$ ) and then discuss how to use the structural advantages of Bayesian networks to approximate this function. In the sequel, we will use the symbol $\rho$ to denote the importance sampling function and $\rho^{*}_{}$ to denote the optimal importance sampling function.

$\displaystyle \rho$ ( $\displaystyle \bf X$ \ $\displaystyle \bf E$ ) = $\displaystyle {\frac{{\mbox{\rm Pr}}({\bf X}\backslash{\bf E},{\bf E}={\bf e})}{{\mbox{\rm Pr}}({\bf E}={\bf e})}}$ = Pr( $\displaystyle \bf X$ | $\displaystyle \bf E$ = $\displaystyle \bf e$ ) .

Corollary 2 The optimal importance sampling function $\rho^{*}_{}$ ( $\bf X$ \ $\bf E)$ for calculating Pr( $\bf E$ = $\bf e$ ) in Equation 6 is Pr( $\bf X$ | $\bf E$ = $\bf e$ ).

Although we know the mathematical expression for the optimal importance sampling function, it is difficult to obtain this function exactly. In our algorithm, we use the following importance sampling function

$\displaystyle \rho$ ( $\displaystyle \bf X$ \ $\displaystyle \bf E$ ) = $\displaystyle \prod\limits_{i=1}^{n}$ Pr(X_i|Pa(X_i), $\displaystyle \bf E$ ) .

(8)

From Section 2.2, we know that the score sums corresponding to {x_i,pa(X_i), $\bf e$ } can yield an unbiased estimator of Pr(x_i,pa(X_i), $\bf e$ ). According to the definition of conditional probability, we can get an estimator of Pr(x_i|pa(X_i), $\bf e$ ). This can be achieved by maintaining an updating table for every node, the structure of which mimicks the structure of the CPT. Such tables allow us to decompose the above importance function into components that can be learned individually. We will call these tables the importance conditional probability tables (ICPT).

Definition 1 An importance conditional probability table (ICPT) of a node X is a table of posterior probabilities Pr(X|Pa(X), $\bf E$ = $\bf e$ ) conditional on the evidence and indexed by its immediate predecessors, Pa(X).

Theorem 2

X_i $\displaystyle \in$ $\displaystyle \bf X$ , X_i $\displaystyle \notin$ Anc( $\displaystyle \bf E$ ) $\displaystyle \Rightarrow$ Pr(X_i|Pa(X_i), $\displaystyle \bf E$ ) = Pr(X_i|Pa(X_i)) .

(9)

Proof: Suppose we have set the values of all the parents of node X_i to pa(X_i). Node X_i is dependent on evidence $\bf E$ given pa(X_i) only when X_i is d-connecting with $\bf E$ given pa(X_i) [Pearl1988]. According to the definition of d-connectivity, this happens only when there exists a member of X_i's descendants that belongs to the set of evidence nodes $\bf E$ . In other words X_i $\notin$ Anc( $\bf E$ ). $\Box$

Theorem 2 is very important for the AIS-BN algorithm. It states essentially that the ICPT tables of those nodes that are not ancestors of the evidence nodes are equal to the CPT tables throughout the learning process. We only need to learn the ICPT tables for the ancestors of the evidence nodes. Very often this can lead to significant savings in computation. If, for example, all evidence nodes are root nodes, we have our ICPT tables for every node already and the AIS-BN algorithm becomes identical to the likelihood weighting algorithm. Without evidence, the AIS-BN algorithm becomes identical to the probabilistic logic sampling algorithm.

**Figure 3:** The AIS-BN algorithm for learning the optimal importance function.
$\fbox{ \parbox{5.3in}{ \renewedcommand{baselinestretch}{1.0} \begin{minipage}[t]... ...i),{\bf e})\right) \end{eqnarray*}\end{enumerate}{\bf end for} \end{minipage}}}$

It is worth pointing out that for some X_i, Pr(X_i| P_a(X_i), $\bf E$ ) (i.e., the ICPT table for X_i), can be easily calculated using exact methods. For example, when X_i is the only parent of an evidence node E_j and E_j is the only child of X_i, the posterior probability distribution of X_i is straightforward to compute exactly. Since the focus of the current paper is on sampling, the test results reported in this paper do not include this improvement of the AIS-BN algorithm.

Figure 3 lists an algorithm that implements Step 7 of the basic AIS-BN algorithm listed in Figure 2. When we estimate Pr(x_i|pa(X_i), $\bf e$ ), we only use the samples obtained at the current stage. One reason for this is that the information obtained in previous stages has been absorbed by Pr^k( $\bf X$ \ $\bf E$ ). The other reason is that in principle, each successive iteration is more accurate than the previous one and the importance function is closer to the optimal importance function. Thus, the samples generated by Pr^{k + 1}( $\bf X$ \ $\bf E$ ) are better than those generated by Pr^k( $\bf X$ \ $\bf E$ ). Pr(X_i|pa(X_i), $\bf e$ ) - Pr^k(X_i|pa(X_i), $\bf e$ ) corresponds to the vector of first partial derivatives in the direction of the maximum decrease in the error. $\eta$ (k) is a positive function that determines the learning rate. When $\eta$ (k) = 0 (lower bound), we do not update our importance function. When $\eta$ (k) = 1 (upper bound), at each stage we discard the old function. The convergence speed is directly related to $\eta$ (k). If it is small, the convergence will be very slow due to the large number of updating steps needed to reach a local minimum. On the other hand, if it is large, convergence rate will be initially very fast, but the algorithm will eventually start to oscillate and thus may not reach a minimum. There are many papers in the field of neural network learning that discuss how to choose the learning rate and let estimated importance function converge quickly to the destination function. Any method that can improve learning rate should be applicable to this algorithm. Currently, we use the following function proposed by Ritter et al. [1991]

$\displaystyle \eta$ (k) = a $\displaystyle \left(\vphantom{\frac{b}{a}}\right.$ $\displaystyle {\frac{b}{a}}$ $\displaystyle \left.\vphantom{\frac{b}{a}}\right)^{k/k_{\max }}_{}$ ,

(10)