Key concepts:
The best text to learn about Bayes Nets is probably Judea Pearl's original paper on the topic.
Bayes networks, also called Belief nets, are a graphical way of representing local dependencies among a set of variables.
It is best illustrated by an example.
Caption: Bayes network showing dependencies between variables related to diagonosis of problems in a car.
More later.
Given a network with a set of nodes $(X_1, X_2, \cdots, X_K)$, the joint probability of all the variables in the network is given by \[ P(X_1, X_2, \cdots, X_K) = \prod_i P(X_i | parents(X_i)) \]
The marginal probability of any subset $(X_1, X_2, \cdots, X_N)$ of variables can now be computed by marginalizing out the remaining variables: \[ P(X_1, \cdots, X_N) = \sum_{X_{N+1}, \cdots, X_K} \prod_{i=1}^K P(X_i | parents(X_i)) \]
Bayes nets may be used to compute the a posteriori probability distributions of variables (represented by nodes), given evidence.
“Evidence”, in this context, generally refers to the values of other variables that may be observed.
In the car example above for instance, we may be given that the lights function, the fuel gauge shows a full tank, and the battery was recently changed, but the car does not start. These are four evidentiary values, specifying the the values of variables $L$ ($L=1$), $G$ ($G=1$), $A$ ($A = 0$), and $E$ ($E$=0).
We must determine the probability that the spark plug is OK, i.e., given all this evidence. That is, we would like to compute the probability $P_K(1 | L=1, G=1, A=0, E=0)$, or more generally, the distribution $P_K(k | L=1, G=1, A=0, E=0)$ for all possible values of $K$.
We can do so as follows: \[ P_K(k | L=1, G=1, A=0, E=0) = \frac{P_{K,L,G,A,E}(k, 1, 1,0,0)}{P_{L,G,A,E}( 1, 1,0,0)} \]
The above term cannot, however, be directly computed, since what we actually have from the Bayes network is the joint probability of all variables $(A,B,L,S,T,K,E,M,N,F,Y,G)$. In order to obtain the marginal probabilities $P_{K,L,G,A,E}(k, 1, 1,0,0)$ and $P_{L,G,A,E}( 1, 1,0,0)$, we must marginalize out other variables. To do so, we must compute \begin{align} P_K(k | L=1, G=1, A=0, E=0) &= \frac{\sum_b\sum_s\sum_t\sum_p\sum_n\sum_f\sum_y P_{A,B,L,S,T,K,E,M,N,F,Y,G}(0,b,1,s,t,k,0,m,n,f,y,1)}{\sum_b\sum_s\sum_t\sum_m\sum_{k'}\sum_n\sum_f\sum_y P_{A,B,L,S,T,K,E,M,N,F,Y,G}(0,b,1,s,t,k',0,m,n,f,y,1)} \\ &= \frac{\sum_b\sum_s\sum_t\sum_m\sum_n\sum_f\sum_y P_A(0)P_L(1)P_{B|A,L}(b|0,1)P_S(s)P_{T|B,S}(t|b,s)P_M(m)P_N(n)P_F(f)P_{Y|M,N,F}(y|m,n,f)P_{G|F}(1|f)P_K(k)P_{E|K,Y}(0|k,y)}{\sum_b\sum_s\sum_t\sum_{k'}\sum_m\sum_n\sum_f\sum_y P_A(0)P_L(1)P_{B|A,L}(b|0,1)P_S(s)P_{T|B,S}(t|b,s)P_M(m)P_N(n)P_F(f)P_{Y|M,N,F}(y|m,n,f)P_{G|F}(1|f)P_K(k')P_{E|K,Y}(0|k',y)} \end{align}
More generally, given a Bayes net with a set of nodes ${\mathcal Y}$, and given the values at a set $E$ evidence nodes, the a posteriori probability distribution of a variable $X$ can be computed as \[ P_X(x|E) = \frac{\sum_{Y \in {\mathcal Y}\setminus (E \bigcup X)}P_X(x|parents(X)) \prod_{Z \in E} P(Z | parents(Z))\prod_{Y \in {\mathcal Y}\setminus (E \bigcup X)}P(Y | parents(Y))}{\sum_{Y \in {\mathcal Y}\setminus E }\prod_{Z \in E} P(Z | parents(Z)) \prod_{Y\in{\mathcal Y}\setminus E}P(Y | parents(Y))} \]
The ugly-looking equation above is not as ugly as it looks. $P(Z|parents(Z))$ refers to the probability of $Z$ given the values of the variables at the immediate parents of $Z$ in the network. ${\mathcal Y}\setminus E$ refers to all nodes in the network excluding the evidence nodes for which values have been specified. ${\mathcal Y}\setminus (E \bigcup X)$ refers to all nodes in the network excluding both the evidence node and the node $X$ for which the a posteriori probability distribution is being computed. The summation in the numerator is simply marginalizing out all nodes other than $X$ for which we have no evidence. The summation in the denominator is marginalizing out all nodes (including $X$) for which we have no evidence.
The important thing to note is that computing the a posteriori probability distribution of $X$ as given above requires a series of nested summations over all marginalized non-evidence nodes. For even medium-sized networks, this can become a tremendously expensive operation (and is exponential in the size of the network).
We can sometimes compute the margina distribution of specific variables or groups of variables much more efficiently by rearranging the summations. This technique is called variable elimination.
The general idea is to arrange all nodes in topological sort order, and performing the summations outward-in.
Let $X_1, \cdots, X_N$ be the nodes in topological sorted order. We would like to compute the marginal probability of $X_K$ taking a value $x_k$.
We can write \begin{align} P_{X_K}(x_k | E) &= \sum_{x_1} \cdots \sum_{x_{K-1}} \sum_{x_{K+1}}\cdots \sum_{x_N} P_{X_1}(x_1 |E) \cdots P_{X_{K-1}}(x_{k-1} | parents(X_{K-1}),E)P_{X_{K}}(x_{k} | parents(X_{K}),E)P_{X_{K+1}}(x_{k+1} | parents(X_{K+1}),E)\cdots P_{X_N}(x_N | parents(X_N),E) \end{align} \begin{align} &= \sum_{x_1} P_{X_1}(x_1|E) \sum_{x_2} P_{X_2}(x_2 | parents(X_2),E) \cdots \sum_{x_{k-1}}P_{X_{K-1}}(x_{k-1} | parents(X_{K-1}),E)P_{X_{K}}(x_{k} | parents(X_{K}),E)\sum_{x_{k+1}}P_{X_{K+1}}(x_{k+1} | parents(X_{K+1}),E) \cdots \sum_{x_N} P_{X_N}(x_N | parents(X_N),E) \end{align}
While not immediately obvious from the above equation, the second version greatly reduces the number of summations. For instance, $\sum_{x_N}P_{X_N}(x_N | parents(X_N),E)$ only requires a summation over all values of $X_N$ for each combination of values of the parents of $X_N$.
Consider the fuel gauge $G$ in our car network. Assume that in the topological sort, $G$ ends up at the end. In the above form of the solution, we will only need to compute $\sum_g P_G(g | F = 0, E) $ and $\sum_g P_G(g | F=1, E)$, for a total of four summations. These values are computed once, and there is no further summation over $G$. Even more importantly, since $G$ only depends on $F$, if neither $G$ nor $F$ is providing evidence (i.e. if $G$ and $F$ values are not specified), then each of the above summations simply becomes 1. The variable $G$ is eliminated from consideration.
Variable elimination is an effective mechanism for computing the marginal probability distribution of a single node. But if we want to compute the marginals for all the nodes in the network at once efficiently, we require belief propagation.
A much faster way to compute a posteriori probabilities of variables is to consider the fact that each variable in the net actually only connects to a small number of other variables, namely its immediate parents, and its immediate children.
Consider any network that is a directed a-cyclic graph (DAG). We wish to compute the a posteriori probability of a variable $X$, given some evidence $E$. Consider the following simple subgraph (where we have only shown $X$, its parents and its children).
Caption: A node $X$ in a network, its parents, and children.
We aim to compute the a posteriori probability $P_X(x|E)$.
In order to compute the a posteriori probability $P_X(x|E)$, we can compute the joint probability distribution of $X$, its parents and its children, and marginalize out the parents and children out. \[ P_X(x|E) = \sum_{y_1,\cdots y_J}\sum_{z_1,\cdots,z_K} P_{Y_1,\cdots,Y_J,X,Z_1,\cdots,Z_K}(y_1, \cdots, y_J, x, z_1, \cdots, z_K | E) \]
where the notation $\sum_{y_1,\cdots y_J}$ represents a summation over all combintions of values taken by $Y_1, \cdots, Y_J$, and $\sum_{z_1,\cdots z_K}$ represents a summation over all combintions of values taken by $Z_1, \cdots, Z_K$.
The reason for rewriting it in the above manner may not be immediately apparent, but as we see below, this enables us to eliminate the direct dependence of $X$ on the evidence $E$.
Using the depenencies depicted in the network, we can write \begin{align} P_{Y_1,\cdots,Y_J,X,Z_1,\cdots,Z_K}(y_1, \cdots, y_J, x, z_1, \cdots, z_K | E) &= P_{Y_1}(y_1|E)\cdots P_{Y_J}(y_J|E) P_{X|Y_1,\cdots,Y_J}(x | y_1,\cdots,y_J) P_{Z_1|X}(z_1|x,E) \cdots P_{Z_K|X}(z_K|x,E) \\ &= \prod_{j=1}^J P_{Y_j}(y_j | E) P_{X|Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{k=1}^K P_{Z_K|X}(z_k | x, E) \end{align}
Combining it with the marginalization given above, we get \begin{align} P_X(x|E) &= \sum_{y_1,\cdots y_J} \sum_{z_1,\cdots,z_K} \prod_{j=1}^J P_{Y_j}(y_j | E) P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{k=1}^K P_{Z_k|x}(z_k | x, E) \\ &= \sum_{y_1,\cdots y_J} \prod_{j=1}^J P_{Y_j}(y_j | E) P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \sum_{z_1,\cdots,z_K} \prod_{k=1}^K P_{Z_k|X}(z_k | x, E) \\ &= \sum_{y_1,\cdots y_J} P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{j=1}^J P_{Y_j}(y_j | E) \prod_{k=1}^K\sum_{z_1,\cdots,z_K}P_{Z_k|X}(z_k | x, E) \end{align}
Grouping the terms, we can hence write \[ P_X(x|E) = \bigg(\sum_{y_1,\cdots y_J} P_{X | Y_1, \cdots, Y_J}(x | y_1, \cdots, y_J) \prod_{j=1}^J P_{Y_j}(y_j | E) \bigg)\bigg(\prod_{k=1}^K\sum_{z_1,\cdots,z_K}P_{Z_k|X}(z_k | x, E)\bigg) \]
or more generally, with some minor abuse of notation \begin{equation}\label{1}\tag{1} P_X(x | E) = \Bigg(\sum_{V_Y \in \{(v_Y: Y \in {\mathcal P}(X))\}} P_{X|{\mathcal P}(X)}(x|V_Y) \prod_{Y \in {\mathcal P}(X)} P_Y(v_Y|E) \Bigg) \Bigg(\prod_{V_Z \in \{(v_Z: Z \in {\mathcal C}(Z))\}} \sum_{Z \in {\mathcal C}(X)} P_Z(v_Z|X,E) \Bigg) \end{equation}
Here ${\mathcal P}(X)$ represents all parents of $X$, $v_Y$ represents a value taken by a parent $Y$, $(v_Y: Y \in {\mathcal P}(X))$ represents a combination of values taken by all parents of $Y$, and $V_Y \in \{(v_Y: Y \in {\mathcal P}(X))\}$ represents the complete set of all such combinations.
Similarly, ${\mathcal C}(X)$ represents all children of $X$, $v_Z$ represents a value taken by a child $Z$, $(v_Z: Z \in {\mathcal C}(X))$ represents a combination of values taken by all children of $Y$, and $V_Z \in \{(v_Z: Z \in {\mathcal C}(X))\}$ represents the complete set of all such combinations.
The a posteriori probability of any value of $X$ can thus be factored into two terms, represented by the two terms in parentheses in the above equation. The first represents belief in the value obtained from the parents of $X$. The second is belief derived from the children.
Thus the a posteriori probability of $X$ is a product of beliefs inherited from its parents and its children. To complete the solution, however, we need the beliefs of the parents and children, which in turn can be similarly computed. This leads to an iterative procedure called Belief Propagation.
The figure below illustrates the BP procedure. Each node $X$ in the graph has a “forward” belief $F_X(x)$, which indicates the forward belief that $X$ will take the value $x$, obtained from its parents. Each node also a “backward” belief $B_X(x)$, which is the belief that $X$ will take value $x$ based on its effect on its children.
Caption: The figure shows the foward belief messages passed to a node $X$ by its parents, and the backward belief messages obtained from its children. The figure also shows all the nodes that contribute to these messages. Downward arrows represent forward messages. Upward arrows represent backward messages. Blue arrows represent all messages that finally contribute to the forward beliefs at $X$. Red arrows represent all messages that contribute to the backward beliefs at $X$. The forward beliefs $F_{Y_1}(y_1)$ and $F_{Y_2}(y_2)$ at parent nodes $Y_1$ and $Y_2$ also contribute forward messages $F_{Y_i,X}(y_i)$ into $X$. These forward messages are used to compose the forward belief $F_X(x)$. Similarly, the backward beliefs $B_{Z_1}(z_1)$ and $B_{Z_2}(z_2)$ at child nodes $Z_1$ and $Z_2$ also contribute to the backward messages $B_{Z_i,X}(x)$ into $X$, which are used to compose backward belief $B_X(x)$.
The forward belief $F_X(x)$ is derived from forward belief messages $F_{Y,X}(y,x)$ from each parent $Y$ of $X$, which are in turn functions of the forward beliefs of the parents.
The backward belief $B_X(x)$ is derived from backward belief messages $B_{Z,X}(x)$ from the children of $X$. The messages are in turn functions of the backward beliefs of the children.
The complete set of equations is as follows.
Node Beliefs:
Here, as before, ${\mathcal P}(X)$ refers to the parents of $X$, and ${\mathcal C}(X)$ refers to its children. $v_Y$ refers to a value of $Y$, and $\{(v_Y: Y \in {\mathcal P}(X))\}$ refers to the set of all combinations of values of all of the parents of $X$.
Note that the forward and backward belief equations are exactly analogous to the first and second parenthetized terms in the Equation $\eqref{1}$ above.
Note that the forward messages $F_{Y,X}(v_Y)$ from the parents refer to the values $v_Y$ taken by the parent nodes, but not to the value $x$ taken by $X$. On the other hand, the backward messages $B_{Z,X}$ from the children depend only on the value of $X$, but do not refer to the actual values taken by its children. This is also analogous to the terms in Equation $\eqref{1}$, and reflects the fact that the direction of information flow follows the direction of the arrows in the graph.
Special Cases:
Messages:
Here $({\mathcal C}(Y) \setminus X)$ refers to set of all children of $Y$ excluding $X$. Similarly, $({\mathcal P}(Z)\setminus X)$ is the set of all parents of $Z$ excluding $X$.
In Equation $\eqref{bwdmsg}$ $\{(v_Y: Y \in ({\mathcal P}(Z)\setminus X)\}$ refers to the set of all combinations of values of all parents of $Z$ excluding $X$. $\sum_{V_Y \in \{(v_Y: Y \in ({\mathcal P}(Z)\setminus X)\}}$ refers to a summation over all elements of this set. The notation $P_{Z|X,({\mathcal P}(Z)\setminus X)} (z|x,V_Y)$ refers to the conditional probability of $Z$ taking the value $z$, given that $X$ takes the value $X$ and the remaining parents of $Z$ take the values $V_Y$.
Note that the forward message in Equation $\eqref{fwdmsg}$ is simply the product of the forward belief at $Y$ and the backward belief at $Y$ when $X$ is assued to be actually equal to $x$. Similarly, Equation $\eqref{bwdmsg}$ is the product of the forward belief at $Z$ when $X$ is fixed to $x$. The import of this becomes obvious below.
The a posteriori probablity distribution of $X$ now simply becomes \[\label{posterior}\tag{6} P_X(x|E) = \frac{F_X(x)B_X(x)}{\sum_{\hat{x}}F_X(\hat{x})B_X(\hat{x})} \]
Given: Network and set of evidence nodes ${\mathcal E}$ where the value of the node has been specified.
Initialization:
Arrange (Optional):
Iterate to convergence:
Compute Posterior Probability
The following “pseudo-code” may clarify. Note the distinction between $[X]$ and $(x)$, both of which are used for indexing. $[X]$ is used to index w.r.t nodes and $(x)$ to index w.r.t. node values.
#Notation: # NET = The complete network specification, including nodes and edges # # V = set of all nodes in net # SOURCE = set of source nodes in V # SINK = set of sink nodes in V # # Vsort = All nodes in topogical sort order # Vrevsort = All nodes in reverse topological sort order # # E = set of all nodes for which a value has been specified as evidence # For each node X in E v(X) = Specified evidence value for X # # X = individual node in net # par(X) = set of parent nodes for node X # Assuming par(X) = empty set for source nodes # child(X) = set of child nodes for node X # Assuming child(X) = empty set for sink nodes # # val(X) = set of values that X can take # N(X) = Number of distinct values X can take (i.e. size(val(X))) # # # P[X](x | val(par(X))) = probability table for all non-source nodes X # P[X](x) = probability table for source nodes X # # F[X](x) = Forward belief for node X and value x # B[X](x) = Backward belief for node X and value x # FM[X][Z](x) = Forward message from node X at value x to child node Z # BM[X][Y](y) = Backward message from node X to parent node Y at value y # Initialize: # Initially set everything to 1: All values have equal belief. # Note the difference in how forward and backward beliefs are initialized for X in V: for v in val(X): F[X](v) = 1, B[X](v) = 1 for Z in child(X), for v in val(X): FM[X][Z](x) = 1 for Y in par(X), for v in val(Y): BM[X][Y](y) = 1 endfor # Initialize beliefs for Evidence nodes. # Remember that the that the default values were initialized to 0 # v(X) is the *observed* evident value of X for X in E: for x in val(X): if x == v(X): F[X](x) = B[X](x) = 1 else: F[X](x) = B[X](x) = 0 endfor # If we forward pass is done first, this should not be necessary for Z in child(X): if x == v(X): F[X][Z](x) = 1 else: F[X][Z](x) = 0 endfor endfor # Initialize forward belief for source nodes for X in SOURCE, for v in val(X): F[X](v) = P[X](x) # Initialize backward belief for sink nodes (redundant) for X in SINK, for v in val(X): B[X](v) = 1 Sort: # You may any algorithm for topological sorting. # Kahn's algorithm is simple to implement. Vsort = TopologicalSort(V, NET) # The reverse topological sort could just be the reverse of Vsort Vrevsort = ReverseTopologicalSort(V, NET) #An ugly little helper function to compute forward beliefs recursively #We need recursion because each node may have a different number of #parents. #If we wanted to be efficient, we would flatten this computation into #nested loops at every node. #"FixedNodes" is a array of nodes for which a value is externally # fixed. We dont need it to compute forward beliefs, but need it for # backward messages getfwdbelief(FM, Pset, FixedNodes, inPvalueSet, X, x): P = pop Pset FB = 0 for v = val(P): if P in FixedNodes and FixedNodes[P] != v: continue PvalueSet = [inPvalueSet v] if Pset is empty: FB += FM[P][X](v) * P(x | PvalueSet) else: FB += FM[P][X](v) * getfwdbelief(Pset,FixedNodes,PvalueSet,X,x) endif endfor return FB ForwardPass: for X in Vsort: # In the forward pass, source nodes are not modified if X in SOURCE: continue # Compute forward belief at this node # The "getfwdbelief" routine is a little more detailed than the rest # of this code for x in val(X): # Evidence nodes are not modified if X in E: continue F[X](x) = getfwdbelief(FM, par(X), [], [], X, x) endfor # Compute forward messages to children for Z in child(X): for x in val(X): FM[X][Z](x) = F[X](x) * (B[X](x) / BM[Z][X](x)) endfor endfor BackwardPass: for X in Vrevsort: # In the reverse pass, sink nodes are not modified if X in SINK continue # Backward belief for x in val(X): # Evidence nodes are not modified if X in E: continue B[X](x) = 1 for C in child(X): B[X](x) *= BM[C][X](x) endfor # Backward messages to parents for Y in par(X): # Make sure to clear FixY first, so that it has only ONE entry FixY = [] for y in val(Y): BM[X][Y](y) = 0 FixY[Y] = y for x in val(X): BM[X][Y](y) += B[X](x)*getfwdbelief(FM,par(X),FixY,[],X,x) endfor endfor endfor endfor # Numiter=1 is sufficient for trees, and Numiter = NumNodes is sufficient for # any directed acyclic net Iterate: Initialize for i = 1:numiter Forwardpass Backwardpass endfor ComputePosteriors: for X in V: if X in E: continue sumP = 0; for x in val(X) posterior[X](x) = F[X](x) * B[X](x) sumP += posterior[X](x) endfor for x in val(X) posterior[X](x) /= sumP endfor endfor