Next: Restricted Acyclic Partially Directed Up: Searching for Bayesian Network Previous: Introduction

DAGs and Equivalence Classes of DAGs

The search procedures used within Bayesian network learning algorithms usually operate on the space of DAGs. In this context, the problem can be formally expressed as: Given a complete training set (i.e. we do not consider missing values or latent variables) $D=\{{\bf u}^1,\ldots , {\bf u}^m\}$ of instances of a finite set of variables, ${\mathcal{U}}$ , find the DAG such that

$\begin{displaymath} H^*=\arg\max_{H \in {\mathcal H}_n} g(H:D) \end{displaymath}$

(1)

where

is a scoring function measuring the fitness of any candidate DAG

to the dataset

, and ${\mathcal H}_n$ is the family of all the DAGs with

nodes².

Many of the search procedures, including the commonly used local search methods, rely on a neighborhood structure that defines the local rules (operators) used to move within the search space. The standard neighborhood in the space of DAGs uses the operators of arc addition, arc deletion and arc reversal, thereby avoiding (in the first and the third case) the inclusion of directed cycles in the graph.

The algorithms that search in the space of DAGs using local methods are efficient mainly because of the decomposability property that many scoring functions exhibit. A scoring function is said to be decomposable if the score of any Bayesian network structure may be expressed as a product (or a sum in the log-space) of local scores involving only one node and its parents:

$\displaystyle g(H:D) = \sum_{y\in {\mathcal{U}}} g_D(y,Pa_H(y))$			(2)

$\displaystyle g_D(y,Pa_H(y)) = g(y,Pa_H(y):N_{y,Pa_H(y)})$			(3)

where $N_{y,Pa_H(y)}$ are the statistics of the variables

and

, i.e. the number of instances in

that match each possible instantiation of

and

will denote the parent set of

in the DAG

, i.e. $Pa_H(y)=\{t\in {\cal U} \vert t \rightarrow y \in H\}$ .

A procedure that changes one arc at each move can efficiently evaluate the improvement obtained by this change. Such a procedure can reuse the computations carried out at previous stages, and only the statistics corresponding to the variables whose parent sets have been modified need to be recomputed. The addition or deletion of an arc $x\rightarrow y$ in a DAG can therefore be evaluated by computing only one new local score, $g_D(y,Pa_{H\cup\{x\}}(y))$ or $g_D(y,Pa_{H\setminus\{x\}}(y))$ , respectively. The evaluation of the reversal of an arc $x\rightarrow y$ requires the computation of two new local scores, $g_D(y,Pa_{H\setminus\{x\}}(y))$ and $g_D(x,Pa_{H\cup\{y\}}(x))$ .

It should be noted that each structure in the DAG space is not always different from the others in terms of its representation capability: if we interpret the arcs in a DAG as causal interactions between variables, then each DAG represents a different model; however, if we see a DAG as a set of dependence/independence relationships between variables (that permits us to factorize a joint probability distribution), then different DAGs may represent the same model. Even in the case of using a causal interpretation, if we use observation-only data (as opposed to experimental data where some variables may be manipulated), it is quite common for two Bayesian networks to be empirically indistinguishable. When two DAGs and can represent the same set of conditional independence assertions, we say that these DAGs are equivalent³ [59], and we denote this as $H\simeq H'$ .

When learning Bayesian networks from data using scoring functions, two different (but equivalent) DAGs may be indistinguishable, due to the existence of invariant properties on equivalent DAGs, yielding equal scores. We could take advantage of this in order to get a more reduced space of structures to be explored.

The following theorem provides a graphical criterion for determining the equivalence of two DAGs:

Theorem 1 [59] Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures.

The skeleton of a DAG is the undirected graph that results from ignoring the directionality of every edge. A v-structure in a DAG

is an ordered triplet of nodes,

, such that (1)

contains the arcs $x\rightarrow y$ and $y \leftarrow z$ , and (2) the nodes

and

are not adjacent in

. A head-to-head pattern (shortened h-h) in a DAG

is an ordered triplet of nodes,

, such that

contains the arcs $x\rightarrow y$ and $y \leftarrow z$ . Note that in an h-h pattern

the nodes

and

can be adjacent.

Another characterization of equivalent DAGs was presented by [15], together with proof that several scoring functions used for learning Bayesian networks from data give the same score to equivalent structures (such functions are called score equivalent functions).

The concept of equivalence of DAGs partitions the space of DAGs into a set of equivalence classes. Whenever a score equivalent function is used, it seems natural to search for the best configuration in this new space of equivalence classes of DAGs. This change in the search space may bring several advantages:

The space of equivalent classes is more reduced than the space of DAGs (although it is still enormous). We could therefore expect to obtain better results (with the same search effort).
As we do not spend time generating (using the operators defined to move between neighboring configurations in the search space) and evaluating (using the scoring function) equivalent DAGs, we could obtain more efficient algorithms. However, as the ratio of the number of equivalence classes to the number of DAGs seems (empirically) to asymptote to 0.267 [15], the efficiency improvement may be small.
The search in the space of DAGs may be easily trapped in a local optimum, and the situation worsens as the operators defined for this space can move between configurations corresponding to equivalent DAGs (which will be evaluated with the same score). This difficulty can be partially avoided if we search in the space of equivalence classes.

The disadvantages are that, in this space of equivalence classes, it is more expensive to generate neighboring configurations, because we may be forced to perform some kind of consistency check, in order to ensure that these configurations represent equivalence classes⁴; in addition, the evaluation of the neighboring configurations may also be more expensive if we are not able to take advantage of the decomposability property of the scoring function. Finally, the new search space might introduce new local maxima that are not present in DAG space.

In order to design an exploring process for the space of equivalence classes we could use two distinct approaches: the first consists in considering that an equivalence class is represented by any of its components (in this case, it is necessary to avoid evaluating more than one component per class); and the second consists in using a canonical representation scheme for the classes.

The graphical objects commonly used to represent equivalence classes of DAGs are acyclic partially directed graphs [59] (known as PDAGs). These graphs contain both directed (arcs) and undirected (links) edges, but no directed cycles. Given a PDAG defined on a finite set of nodes ${\cal U}$ and a node $y\in {\cal U}$ , the following subsets of nodes are defined:

$Pa_G(y)=\{t\in {\cal U} \vert t \rightarrow y \in G\}$ , the set of parents of .
$Ch_G(y)=\{t\in {\cal U} \vert y \rightarrow t \in G\}$ , the set of children of .
$Ne_G(y)=\{t\in {\cal U} \vert y\mbox{-} t \in G\}$ , the set of neighbors of .
$Ad_G(y)=\{t\in {\cal U} \vert t \rightarrow y \in G, \mbox{ or } y\rightarrow t \in G \mbox{ or } y\mbox{-} t \in G\}$ , the set of adjacents to . Obviously $Ad_G(y)=Pa_G(y)\cup Ch_G(y)\cup Ne_G(y)$ .

An arc $x\rightarrow y$ in a DAG

is compelled if it appears in every DAG belonging to the same equivalence class as

. An arc $x\rightarrow y$ in

is said to be reversible if it is not compelled, i.e. there is a DAG

equivalent to

that contains the arc $x\leftarrow y$ . As every DAG in a particular equivalence class has the same set of compelled and reversible arcs, a canonical representation of an equivalence class is the PDAG consisting of an arc for every compelled arc in the equivalence class, and a link for every reversible arc. This kind of representation has been given several names: pattern [67], completed PDAG (CPDAG) [16], essential graph [4,21]. As a consequence of theorem 1, a completed PDAG possesses an arc $x\rightarrow y$ if and only if a triplet of nodes

forms a v-structure or the arc $x\rightarrow y$ is required to be directed due to other v-structures (to avoid forming a new v-structure or creating a directed cycle) (see Figure 1).

**Figure 1:** (a) Dag and (b) completed PDAG. The arcs $z\rightarrow x$ , $y\rightarrow x$ and $x\rightarrow u$ are compelled; the arc $y\rightarrow w$ is reversible
$\begin{figure}\centerline{\psfig{figure=./figuras/midagypdag.eps,height=.3\textwidth}}\end{figure}$

Note that an arbitrary PDAG does not necessarily represent some equivalence class of DAGs, although there is a one-to-one correspondence between completed PDAGs and equivalence classes of DAGs. Nevertheless, completed PDAGs are considerably more complicated than general PDAGs. A characterization of the specific properties that a PDAG must verify in order to be a completed PDAG was obtained by [4]:

Theorem 2 [4] A PDAG

is a completed PDAG if and only if it satisfies the following conditions:

1: is a chain graph, i.e. it contains no (partially) directed cycles.
2: The subgraph induced by every chain component⁵ of is chordal (i.e. on every undirected cycle of length greater than or equal to 4 there are two non-consecutive nodes connected by a link).
3: The configuration $x\rightarrow y\mbox{-}z$ does not occur as an induced subgraph of .
4: Every arc $x\rightarrow y\in G$ occurs in at least one of the four configurations displayed in Figure 2 as an induced subgraph of .

**Figure 2:** The four different configurations containing an arc $x\rightarrow y$ in a completed PDAG
$\begin{figure}\centerline{\psfig{figure=./figuras/4conf.eps,width=.67\textheight}}%%height=.3\textwidth}} \end{figure}$

Let us illustrate the advantages of searching in the space of equivalence classes of DAGs rather than the space of DAGs with a simple example. Figure 3 displays the set of possible DAGs involving three nodes $\{x,y,z\}$ , with arcs between and , and between and . The first three DAGs are equivalent. In terms of independence information, they lead to the same independence statement $I(y,z\vert x)$ ( and are conditionally independent given ), whereas the statement $I(y,z\vert\emptyset)$ ( and are marginally independent) corresponds to the fourth one. The four DAGs may be summarized by only two different completed PDAGs, shown in Figure 4.

**Figure 3:** Four different DAGs with three nodes and two arcs
$\begin{figure}\centerline{\psfig{figure=./figuras/mipattern2.eps,height=.3\textwidth}}\end{figure}$

**Figure 4:** Two different equivalence classes of DAGs
$\begin{figure}\centerline{\psfig{figure=./figuras/miequival2.eps,height=.15\textwidth}}\end{figure}$

As we can see, the search space may be reduced by using PDAGs to represent the classes: in our example, to two classes instead of four configurations; it can be seen [4] that the ratio of the number of DAGs to the number of classes is for three nodes, for four nodes and for five nodes; in more general terms, the results obtained by [39] indicate that this ratio approaches a value of less than four as the number of nodes increases. The use of equivalence classes therefore entails convenient savings in exploration and evaluation effort, although the gain is not spectacular.

On the other hand, the use of a canonical representation scheme allows us to explore the space progressively and systematically, without losing any unexplored configuration unnecessarily. Returning to our example, let us suppose that the true model is the DAG displayed in Figure 4.b and we start the search with an empty graph (with no arcs). Let us also assume that the search algorithm identifies that an edge between and produces the greatest improvement in the score. At this moment, the two alternatives, $x\rightarrow y$ and $x\leftarrow y$ (case 1 and case 2 in Figure 5, respectively), are equivalent. Let us now suppose that we decide to connect the nodes and ; again we have two options: $x\rightarrow z$ or $x\leftarrow z$ . Nevertheless, depending on the previous selected configuration, we obtain different outcomes that are no longer equivalent (see Figure 5).

If we had chosen case 1 (thus obtaining either case 1.1 or case 1.2), we would have eliminated the possibility of exploring the DAG $z\rightarrow x\leftarrow y$ , and therefore the exploring process would have been trimmed. As the true model is precisely this DAG (case 2.1 in Figure 5), then the search process would have to include another arc connecting and (cases 1.1.1, 1.2.1 or 1.2.2), because and are conditionally dependent given . At this moment, any local search process would stop (in a local optimum), because every local change (arc reversal or arc removal) would make the score worse.

**Figure 5:** Local search in the space of DAGs is trapped at a local optimum
$\begin{figure}\par\centerline{\psfig{figure=./figuras/trapped.eps,height=.32\textwidth}}\end{figure}$

Consequently, our purpose consists in adding or removing edges (either links or arcs) to the structure without pruning the search space unnecessarily. We could therefore introduce links instead of arcs (when there is not enough information to distinguish between different patterns of arcs), which would serve as templates or dynamic linkers to equivalence patterns. They represent any valid combination of arcs which results in a DAG belonging to the same equivalence class.

Looking again at the previous example, we would proceed as follows: assuming that in our search space the operators of link addition and creation of h-h patterns are available, we would first include the link $x\mbox{-}y$ ; secondly, when considering the inclusion of a connection between and , we would have two options, shown in Figure 4: the h-h pattern $z\rightarrow x\leftarrow y$ and the pattern $z \mbox{-}x \mbox{-}y$ . In this case the scoring function would assign the greatest score to the h-h pattern $z\rightarrow x\leftarrow y$ , thus obtaining the correct DAG.

Next: Restricted Acyclic Partially Directed Up: Searching for Bayesian Network Previous: Introduction

Luis Miguel de Campos Ibáñez 2003-05-30