Bayes Networks

Bayes networks, also called Belief nets, are a graphical way of representing local dependencies among a set of variables.

Given a network with a set of nodes $(X_1, X_2, \cdots, X_K)$, the joint probability of all the variables in the network is given by \[ P(X_1, X_2, \cdots, X_K) = \prod_i P(X_i | parents(X_i)) \]

The marginal probability of any subset $(X_1, X_2, \cdots, X_N)$ of variables can now be computed by marginalizing out the remaining variables: \[ P(X_1, \cdots, X_N) = \sum_{X_{N+1}, \cdots, X_K} \prod_{i=1}^K P(X_i | parents(X_i)) \]

Inference: Computing a posteriori probabilities

Bayes nets may be used to compute the a posteriori probability distributions of variables (represented by nodes), given evidence.

“Evidence”, in this context, generally refers to the values of other variables that may be observed.

In the car example above for instance, we may be given that the lights function, the fuel gauge shows a full tank, and the battery was recently changed, but the car does not start. These are four evidentiary values, specifying the the values of variables $L$ ($L=1$), $G$ ($G=1$), $A$ ($A = 0$), and $E$ ($E$=0).

We must determine the probability that the spark plug is OK, i.e., given all this evidence. That is, we would like to compute the probability $P_K(1 | L=1, G=1, A=0, E=0)$, or more generally, the distribution $P_K(k | L=1, G=1, A=0, E=0)$ for all possible values of $K$.

We can do so as follows: \[ P_K(k | L=1, G=1, A=0, E=0) = \frac{P_{K,L,G,A,E}(k, 1, 1,0,0)}{P_{L,G,A,E}( 1, 1,0,0)} \]

The above term cannot, however, be directly computed, since what we actually have from the Bayes network is the joint probability of all variables $(A,B,L,S,T,K,E,M,N,F,Y,G)$. In order to obtain the marginal probabilities $P_{K,L,G,A,E}(k, 1, 1,0,0)$ and $P_{L,G,A,E}( 1, 1,0,0)$, we must marginalize out other variables. To do so, we must compute \begin{align} P_K(k | L=1, G=1, A=0, E=0) &= \frac{\sum_b\sum_s\sum_t\sum_p\sum_n\sum_f\sum_y P_{A,B,L,S,T,K,E,M,N,F,Y,G}(0,b,1,s,t,k,0,m,n,f,y,1)}{\sum_b\sum_s\sum_t\sum_m\sum_{k'}\sum_n\sum_f\sum_y P_{A,B,L,S,T,K,E,M,N,F,Y,G}(0,b,1,s,t,k',0,m,n,f,y,1)} \\ &= \frac{\sum_b\sum_s\sum_t\sum_m\sum_n\sum_f\sum_y P_A(0)P_L(1)P_{B|A,L}(b|0,1)P_S(s)P_{T|B,S}(t|b,s)P_M(m)P_N(n)P_F(f)P_{Y|M,N,F}(y|m,n,f)P_{G|F}(1|f)P_K(k)P_{E|K,Y}(0|k,y)}{\sum_b\sum_s\sum_t\sum_{k'}\sum_m\sum_n\sum_f\sum_y P_A(0)P_L(1)P_{B|A,L}(b|0,1)P_S(s)P_{T|B,S}(t|b,s)P_M(m)P_N(n)P_F(f)P_{Y|M,N,F}(y|m,n,f)P_{G|F}(1|f)P_K(k')P_{E|K,Y}(0|k',y)} \end{align}

More generally, given a Bayes net with a set of nodes ${\mathcal Y}$, and given the values at a set $E$ evidence nodes, the a posteriori probability distribution of a variable $X$ can be computed as \[ P_X(x|E) = \frac{\sum_{Y \in {\mathcal Y}\setminus (E \bigcup X)}P_X(x|parents(X)) \prod_{Z \in E} P(Z | parents(Z))\prod_{Y \in {\mathcal Y}\setminus (E \bigcup X)}P(Y | parents(Y))}{\sum_{Y \in {\mathcal Y}\setminus E }\prod_{Z \in E} P(Z | parents(Z)) \prod_{Y\in{\mathcal Y}\setminus E}P(Y | parents(Y))} \]

The ugly-looking equation above is not as ugly as it looks. $P(Z|parents(Z))$ refers to the probability of $Z$ given the values of the variables at the immediate parents of $Z$ in the network. ${\mathcal Y}\setminus E$ refers to all nodes in the network excluding the evidence nodes for which values have been specified. ${\mathcal Y}\setminus (E \bigcup X)$ refers to all nodes in the network excluding both the evidence node and the node $X$ for which the a posteriori probability distribution is being computed. The summation in the numerator is simply marginalizing out all nodes other than $X$ for which we have no evidence. The summation in the denominator is marginalizing out all nodes (including $X$) for which we have no evidence.

The important thing to note is that computing the a posteriori probability distribution of $X$ as given above requires a series of nested summations over all marginalized non-evidence nodes. For even medium-sized networks, this can become a tremendously expensive operation (and is exponential in the size of the network).

Variable Elimination

We can sometimes compute the margina distribution of specific variables or groups of variables much more efficiently by rearranging the summations. This technique is called variable elimination.

The general idea is to arrange all nodes in topological sort order, and performing the summations outward-in.

Let $X_1, \cdots, X_N$ be the nodes in topological sorted order. We would like to compute the marginal probability of $X_K$ taking a value $x_k$.

We can write \begin{align} P_{X_K}(x_k | E) &= \sum_{x_1} \cdots \sum_{x_{K-1}} \sum_{x_{K+1}}\cdots \sum_{x_N} P_{X_1}(x_1 |E) \cdots P_{X_{K-1}}(x_{k-1} | parents(X_{K-1}),E)P_{X_{K}}(x_{k} | parents(X_{K}),E)P_{X_{K+1}}(x_{k+1} | parents(X_{K+1}),E)\cdots P_{X_N}(x_N | parents(X_N),E) \end{align} \begin{align} &= \sum_{x_1} P_{X_1}(x_1|E) \sum_{x_2} P_{X_2}(x_2 | parents(X_2),E) \cdots \sum_{x_{k-1}}P_{X_{K-1}}(x_{k-1} | parents(X_{K-1}),E)P_{X_{K}}(x_{k} | parents(X_{K}),E)\sum_{x_{k+1}}P_{X_{K+1}}(x_{k+1} | parents(X_{K+1}),E) \cdots \sum_{x_N} P_{X_N}(x_N | parents(X_N),E) \end{align}

While not immediately obvious from the above equation, the second version greatly reduces the number of summations. For instance, $\sum_{x_N}P_{X_N}(x_N | parents(X_N),E)$ only requires a summation over all values of $X_N$ for each combination of values of the parents of $X_N$.

Consider the fuel gauge $G$ in our car network. Assume that in the topological sort, $G$ ends up at the end. In the above form of the solution, we will only need to compute $\sum_g P_G(g | F = 0, E) $ and $\sum_g P_G(g | F=1, E)$, for a total of four summations. These values are computed once, and there is no further summation over $G$. Even more importantly, since $G$ only depends on $F$, if neither $G$ nor $F$ is providing evidence (i.e. if $G$ and $F$ values are not specified), then each of the above summations simply becomes 1. The variable $G$ is eliminated from consideration.

Variable elimination is an effective mechanism for computing the marginal probability distribution of a single node. But if we want to compute the marginals for all the nodes in the network at once efficiently, we require belief propagation.

Belief Propagation

A much faster way to compute a posteriori probabilities of variables is to consider the fact that each variable in the net actually only connects to a small number of other variables, namely its immediate parents, and its immediate children.

Consider any network that is a directed a-cyclic graph (DAG). We wish to compute the a posteriori probability of a variable $X$, given some evidence $E$. Consider the following simple subgraph (where we have only shown $X$, its parents and its children).

In order to compute the a posteriori probability $P_X(x|E)$, we can compute the joint probability distribution of $X$, its parents and its children, and marginalize out the parents and children out. \[ P_X(x|E) = \sum_{y_1,\cdots y_J}\sum_{z_1,\cdots,z_K} P_{Y_1,\cdots,Y_J,X,Z_1,\cdots,Z_K}(y_1, \cdots, y_J, x, z_1, \cdots, z_K | E) \]

where the notation $\sum_{y_1,\cdots y_J}$ represents a summation over all combintions of values taken by $Y_1, \cdots, Y_J$, and $\sum_{z_1,\cdots z_K}$ represents a summation over all combintions of values taken by $Z_1, \cdots, Z_K$.

The reason for rewriting it in the above manner may not be immediately apparent, but as we see below, this enables us to eliminate the direct dependence of $X$ on the evidence $E$.

Using the depenencies depicted in the network, we can write \begin{align} P_{Y_1,\cdots,Y_J,X,Z_1,\cdots,Z_K}(y_1, \cdots, y_J, x, z_1, \cdots, z_K | E) &= P_{Y_1}(y_1|E)\cdots P_{Y_J}(y_J|E) P_{X|Y_1,\cdots,Y_J}(x | y_1,\cdots,y_J) P_{Z_1|X}(z_1|x,E) \cdots P_{Z_K|X}(z_K|x,E) \\ &= \prod_{j=1}^J P_{Y_j}(y_j | E) P_{X|Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{k=1}^K P_{Z_K|X}(z_k | x, E) \end{align}

Combining it with the marginalization given above, we get \begin{align} P_X(x|E) &= \sum_{y_1,\cdots y_J} \sum_{z_1,\cdots,z_K} \prod_{j=1}^J P_{Y_j}(y_j | E) P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{k=1}^K P_{Z_k|x}(z_k | x, E) \\ &= \sum_{y_1,\cdots y_J} \prod_{j=1}^J P_{Y_j}(y_j | E) P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \sum_{z_1,\cdots,z_K} \prod_{k=1}^K P_{Z_k|X}(z_k | x, E) \\ &= \sum_{y_1,\cdots y_J} P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{j=1}^J P_{Y_j}(y_j | E) \prod_{k=1}^K\sum_{z_1,\cdots,z_K}P_{Z_k|X}(z_k | x, E) \end{align}

Grouping the terms, we can hence write \[ P_X(x|E) = \bigg(\sum_{y_1,\cdots y_J} P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{j=1}^J P_{Y_j}(y_j | E) \bigg)\bigg(\prod_{k=1}^K\sum_{z_1,\cdots,z_K}P_{Z_k|X}(z_k | x, E)\bigg) \]

or more generally, with some minor abuse of notation \begin{equation}\label{1}\tag{1} P_X(x | E) = \Bigg(\sum_{V_Y \in \{(v_Y: Y \in {\mathcal P}(X))\}} P_{X|{\mathcal P}(X)}(x|V_Y) \prod_{Y \in {\mathcal P}(X)} P_Y(v_Y|E) \Bigg) \Bigg(\prod_{V_Z \in \{(v_Z: Z \in {\mathcal C}(Z))\}} \sum_{Z \in {\mathcal C}(X)} P_Z(v_Z|X,E) \Bigg) \end{equation}

Here ${\mathcal P}(X)$ represents all parents of $X$, $v_Y$ represents a value taken by a parent $Y$, $(v_Y: Y \in {\mathcal P}(X))$ represents a combination of values taken by all parents of $Y$, and $V_Y \in \{(v_Y: Y \in {\mathcal P}(X))\}$ represents the complete set of all such combinations.

Similarly, ${\mathcal C}(X)$ represents all children of $X$, $v_Z$ represents a value taken by a child $Z$, $(v_Z: Z \in {\mathcal C}(X))$ represents a combination of values taken by all children of $Y$, and $V_Z \in \{(v_Z: Z \in {\mathcal C}(X))\}$ represents the complete set of all such combinations.

The a posteriori probability of any value of $X$ can thus be factored into two terms, represented by the two terms in parentheses in the above equation. The first represents belief in the value obtained from the parents of $X$. The second is belief derived from the children.

Thus the a posteriori probability of $X$ is a product of beliefs inherited from its parents and its children. To complete the solution, however, we need the beliefs of the parents and children, which in turn can be similarly computed. This leads to an iterative procedure called Belief Propagation.

The figure below illustrates the BP procedure. Each node $X$ in the graph has a “forward” belief $F_X(x)$, which indicates the forward belief that $X$ will take the value $x$, obtained from its parents. Each node also a “backward” belief $B_X(x)$, which is the belief that $X$ will take value $x$ based on its effect on its children.

Belief propagation

Caption: The figure shows the foward belief messages passed to a node $X$ by its parents, and the backward belief messages obtained from its children. The figure also shows all the nodes that contribute to these messages. Downward arrows represent forward messages. Upward arrows represent backward messages. Blue arrows represent all messages that finally contribute to the forward beliefs at $X$. Red arrows represent all messages that contribute to the backward beliefs at $X$. The forward beliefs $F_{Y_1}(y_1)$ and $F_{Y_2}(y_2)$ at parent nodes $Y_1$ and $Y_2$ also contribute forward messages $F_{Y_i,X}(y_i)$ into $X$. These forward messages are used to compose the forward belief $F_X(x)$. Similarly, the backward beliefs $B_{Z_1}(z_1)$ and $B_{Z_2}(z_2)$ at child nodes $Z_1$ and $Z_2$ also contribute to the backward messages $B_{Z_i,X}(x)$ into $X$, which are used to compose backward belief $B_X(x)$.

The forward belief $F_X(x)$ is derived from forward belief messages $F_{Y,X}(y,x)$ from each parent $Y$ of $X$, which are in turn functions of the forward beliefs of the parents.

The backward belief $B_X(x)$ is derived from backward belief messages $B_{Z,X}(x)$ from the children of $X$. The messages are in turn functions of the backward beliefs of the children.

The BP Algorithm

Here, as before, ${\mathcal P}(X)$ refers to the parents of $X$, and ${\mathcal C}(X)$ refers to its children. $v_Y$ refers to a value of $Y$, and $\{(v_Y: Y \in {\mathcal P}(X))\}$ refers to the set of all combinations of values of all of the parents of $X$.

Note that the forward and backward belief equations are exactly analogous to the first and second parenthetized terms in the Equation $\eqref{1}$ above.

Note that the forward messages $F_{Y,X}(v_Y)$ from the parents refer to the values $v_Y$ taken by the parent nodes, but not to the value $x$ taken by $X$. On the other hand, the backward messages $B_{Z,X}$ from the children depend only on the value of $X$, but do not refer to the actual values taken by its children. This is also analogous to the terms in Equation $\eqref{1}$, and reflects the fact that the direction of information flow follows the direction of the arrows in the graph.

Here $({\mathcal C}(Y) \setminus X)$ refers to set of all children of $Y$ excluding $X$. Similarly, $({\mathcal P}(Z)\setminus X)$ is the set of all parents of $Z$ excluding $X$.

In Equation $\eqref{bwdmsg}$ $\{(v_Y: Y \in ({\mathcal P}(Z)\setminus X)\}$ refers to the set of all combinations of values of all parents of $Z$ excluding $X$. $\sum_{V_Y \in \{(v_Y: Y \in ({\mathcal P}(Z)\setminus X)\}}$ refers to a summation over all elements of this set. The notation $P_{Z|X,({\mathcal P}(Z)\setminus X)} (z|x,V_Y)$ refers to the conditional probability of $Z$ taking the value $z$, given that $X$ takes the value $X$ and the remaining parents of $Z$ take the values $V_Y$.

Note that the forward message in Equation $\eqref{fwdmsg}$ is simply the product of the forward belief at $Y$ and the backward belief at $Y$ when $X$ is assued to be actually equal to $x$. Similarly, Equation $\eqref{bwdmsg}$ is the product of the forward belief at $Z$ when $X$ is fixed to $x$. The import of this becomes obvious below.

The a posteriori probablity distribution of $X$ now simply becomes \[\label{posterior}\tag{6} P_X(x|E) = \frac{F_X(x)B_X(x)}{\sum_{\hat{x}}F_X(\hat{x})B_X(\hat{x})} \]

The Algorithm

Given: Network and set of evidence nodes ${\mathcal E}$ where the value of the node has been specified.

Initialization:

For evidence nodes, where node has been assigned evidence value $X = x_{obs}$:
- Set forward beliefs to $F_X(x_{obs}) = 1$, $F_X(x) = 0~~x \neq x_{obs}$.
- Set backward beliefs to $B_X(x_{obs}) = 1$, $B_X(x) = 0~~x \neq x_{obs}$.
Set forward belief $F_X(x) = P_X(x)$ for all source nodes.
Set backward belief $B_X(x) = 1$ for all sink nodes.
Initialize backward messages $B_{Y,X}(x)$ = 1
Initialize backward belief $B_{X}(x)$ = 1

Arrange (Optional):

${\mathcal F}$ = Arrange all nodes in topological sort order.
${\mathcal R}$ = Arrange all nodes in topological reverse sort order.

Iterate to convergence:

For each node $X$ in ${\mathcal F}$:
1. If $X \notin {\mathcal E}$ update $F_X(x) \forall x$
2. Normalize : $F_X(x) = \frac{F_X(x)}{\sum_{x'}F_X(x')}$
3. For each value $x$, to each child $Z$ propagate forward message $F_{X,Z}(x)$
For each node $X$ in ${\mathcal R}$:
1. If $X \notin {\mathcal E}$ update $B_X(x) \forall x$
2. Normalize : $B_X(x) = \frac{B_X(x)}{\sum_{x'}B_X(x')}$
3. To each parent $Y$, for each value $y$, propagate backward message $B_{X,Y}(y)$

Compute Posterior Probability

$P_X(x|{\mathcal E}) = \frac{F_X(x)B_X(x)}{\sum_{x'}F_X(x')B_X(x')}$

The following “pseudo-code” may clarify. Note the distinction between $[X]$ and $(x)$, both of which are used for indexing. $[X]$ is used to index w.r.t nodes and $(x)$ to index w.r.t. node values.

Bayes Networks

Basic Idea: Defining Bayes Networks

Inference: Computing a posteriori probabilities

Variable Elimination

Belief Propagation

The BP Algorithm

The Algorithm