next up previous
Next: Concluding Remarks Up: Searching for Bayesian Network Previous: Comparison with other Approaches


Experimental Results

In this section we shall describe the experiments carried out with our algorithm, the obtained results, and a comparative study with other algorithms for learning Bayesian networks. We have selected nine different problems to test our algorithm, all of which only contain discrete variables: Alarm (Figure 18), Insurance (Figure 19), Hailfinder (Figure 20), Breast-Cancer, crx, Flare2, House-Votes, Mushroom, and Nursery.

The Alarm network displays the relevant variables and relationships for the Alarm Monitoring System [5], a diagnostic application for patient monitoring. This network, which contains 37 variables and 46 arcs, has been considered as a benchmark for evaluating Bayesian network learning algorithms. The input data commonly used are subsets of the Alarm database built by [43], which contains 20000 cases that were stochastically generated using the Alarm network. In our experiments, we have used three databases of different sizes (the first $k$ cases in the Alarm database, for $k=3000, 5000$ and $10000$).

Figure 18: The Alarm network
\begin{figure}\centerline{\psfig{figure=./figuras/alarm.eps,height=.40\textwidth}}\end{figure}

Insurance [6] is a network for evaluating car insurance risks. The Insurance network contains 27 variables and 52 arcs. In our experiments, we have used five databases containing 10000 cases, generated from the Insurance Bayesian network.

Figure 19: The Insurance network
\begin{figure}\centerline{\psfig{figure=./figuras/insurance.epsi,height=.40\textheight,width=.80\textwidth}}\end{figure}

Hailfinder [1] is a normative system that forecasts severe summer hail in northeastern Colorado. The Hailfinder network contains 56 variables and 66 arcs. In this case, we have also used five databases with 10000 cases generated from the Hailfinder network.

Figure 20: The Hailfinder network
\begin{figure}\centerline{\psfig{figure=./figuras/hailfinder.eps,height=.87\textheight,width=.99\textwidth}}%%width}}
\end{figure}

Breast-Cancer, crx, Flare2, House-Votes, Mushroom, and Nursery are databases available from the UCI Machine Learning Repository. Breast-Cancer contains 10 variables (9 attributes, two of which have missing values, and a binary class variable) and 286 instances. The crx database concerns credit card applications. It has 490 cases and 16 variables (15 attributes and a class variable), and seven variables have missing values. Moreover, six of the variables in the crx database are continuous and were discretized using the MLC++ system [48]. Flare2 uses 13 variables (10 attributes and 3 class variables, one for the number of times a certain type of solar flare occured in a 24-hour period) and contains 1066 instances, without missing values. House-Votes stores the votes for each of the U.S. House of Representatives Congressmen on 16 key votes; it has 17 variables and 435 records and all the variables except two have missing values. Mushroom contains 8124 cases corresponding to species of gilled mushrooms in the Agaricus and Lepiota Family; there are 23 variables (a class variable, stating whether the mushroom is edible or poisonous, and 22 attribute variables) and only one variable has missing values. Nursery contains data relative to the evaluation of applications for nursery schools, and has 9 variables and 12960 cases, without missing values. In all of the cases, missing values are not discarded but treated as a distinct state.

In the first series of experiments, we aim to compare the behavior of our RPDAG-based local search method (RPDAG) with the classical local search in the space of DAGs (DAG). The scoring function selected is BDeu [42] (which is score equivalent and decomposable), with the parameter representing the equivalent sample size set to 1 and a uniform structure prior. The starting point of the search is the empty graph in both cases.

We have collected the following information about the experiments:

BDeu.-
The BDeu score (log version) of the learned network.
Edg.-
The number of edges included in the learned network.
H.-
The Hamming distance, H=A+D+I, i.e. the number of different edges, added (A), deleted (D), or wrongly oriented (without taking into account the differences between equivalent structures) (I), in the learned network with respect to the gold standard network (the original model). This measure is only computed for the three test domains where a gold standard exists.
Iter.-
The number of iterations carried out by the algorithm to reach the best network, i.e. the number of operators used to transform the initial graph into a local optimum.
Ind.-
The number of individuals (either DAGs or RPDAGs) evaluated by the algorithm.
EstEv.-
The number of different statistics evaluated during the execution of the algorithm. This is a useful value to measure the efficiency of the algorithms, because most of the running time of a scoring-based learning algorithm is spent in the evaluation of statistics from the database.
TEst.-
The total number of statistics used by the algorithm. Note that this number can be considerably greater than EstEv. By using hashing techniques we can store and efficiently retrieve any previously calculated statistics. It is not therefore necessary to recompute them by accessing the database, thus gaining in efficiency.
NVars.-
The average number of variables that intervene in the different statistics (i.e. the values $N_{y,Pa_H(y)}$ in Eq. (3)) computed. This value is also important because the time required to compute a statistic increases exponentially with the number of variables involved.
Time.-
The time, measured in seconds, employed by the algorithm to learn the network. Our implementation is written in the JAVA programming language and runs under Linux. This value is only a rough measure of the efficiency of the algorithms, because there are many circumstances that may influence the running time (external loading in a networked computer, caching or any other aspect of the computer architecture, memory paging, use of virtual memory, threading, different code, etc.). Nevertheless, we have tried to ensure that the two algorithms run under the same conditions as far as possible, and the two implementations share most of the code. In fact, the two algorithms have been integrated into the Elvira package (available at http://www.leo.ugr.es/~elvira).

For the Insurance and Hailfinder domains, the reported results are the average values across the five databases considered. The results of our experiments for synthetic data, i.e. Alarm, Insurance and Hailfinder, are displayed in Tables 3, 4 and 5, respectively, where we also show the BDeu values for the true ( $T_{\mbox{{\scriptsize D}}}$) and the empty ( $\emptyset_{\mbox{{\scriptsize D}}}$) networks (with parameters re-trained from the corresponding database $D$), which may serve as a kind of scale. The results obtained for real data are displayed in Table 6.


Table 3: Results for the Alarm databases
  BDeu Edg H A D I Iter Ind EstEv TEst NVar Time
Alarm-3000
RPDAG -33101 46 2 1 1 0 49 63124 3304 123679 2.98 111
DAG -33109 47 7 3 2 2 58 72600 3300 145441 2.88 117
$T_{\mbox{{\scriptsize Alarm3}}}$ -33114 46  
$\emptyset_{\mbox{{\scriptsize Alarm3}}}$ -59890 0  
Alarm-5000
RPDAG -54761 46 2 1 1 0 49 62869 3326 123187 2.97 179
DAG -54956 54 16 9 1 6 60 76212 3391 152663 2.93 194
$T_{\mbox{{\scriptsize Alarm5}}}$ -54774 46  
$\emptyset_{\mbox{{\scriptsize Alarm5}}}$ -99983 0  
Alarm-10000
RPDAG -108432 45 1 0 1 0 48 61190 3264 120049 2.97 346
DAG -108868 52 13 7 1 5 60 75504 3449 151251 2.94 380
$T_{\mbox{{\scriptsize Alarm10}}}$ -108452 46  
$\emptyset_{\mbox{{\scriptsize Alarm10}}}$ -199920 0  



Table 4: Average results for the Insurance domain across 5 databases of size 10000
  BDeu Edg H A D I Iter Ind EstEv TEst NVar Time
RPDAG -133071 45 18 4 10 4 48 36790 1990 69965 2.95 202
DAG -133205 49 25 7 10 8 58 36178 2042 72566 3.03 214
$T_{\mbox{{\scriptsize Insur}}}$ -133040 52  
$\emptyset_{\mbox{{\scriptsize Insur}}}$ -215278 0  



Table 5: Average results for the Hailfinder domain across 5 databases of size 10000
  BDeu Edg H A D I Iter Ind EstEv TEst NVar Time
RPDAG -497872 67 24 12 10 2 68 235374 7490 459436 2.92 847
DAG -498395 75 45 21 13 11 81 240839 7313 482016 2.81 828
$T_{\mbox{{\scriptsize Hail}}}$ -503230 66  
$\emptyset_{\mbox{{\scriptsize Hail}}}$ -697826 0  



Table 6: Results for the UCI databases
  BDeu Edg Iter Ind EstEv TEst NVar Time
Breast-Cancer
RPDAG -2848 6 7 619 151 1232 2.26 0.673
DAG -2848 6 7 619 148 1284 2.26 0.686
crx
RPDAG -5361 19 20 5318 545 10020 2.75 3.03
DAG -5372 19 20 4559 510 9208 2.61 2.85
Flare2
RPDAG -6728 15 16 2637 329 4887 2.71 3.66
DAG -6733 13 14 2012 310 4093 2.45 3.23
House-Votes
RPDAG -4629 22 23 6370 591 11883 2.80 3.10
DAG -4643 23 24 6094 621 12289 2.66 3.39
Mushroom
RPDAG -77239 92 97 43121 2131 78459 3.99 432
DAG -77208 87 103 39944 2173 80175 3.96 449
Nursery
RPDAG -125717 8 9 415 115 803 2.75 11.91
DAG -125717 8 9 611 133 1269 2.59 13.50


As we consider there to be a clear difference between the results obtained for the synthetic and the UCI domains, we shall discuss them separately.

In a second series of experiments, we aim to test the behavior of the search in the RPDAG space when used in combination with a search heuristic which is more powerful than a simple greedy search. The heuristic selected is Tabu Search [40,9], which tries to escape from a local maximum by selecting a solution that minimally decreases the value of the scoring function; immediate re-selection of the local maximum just visited is prevented by maintaining a list of solutions that are forbidden, the so-called tabu list (although for practical reasons the tabu list stores forbidden operators not solutions, and consequently, solutions which have not been visited previously may also become forbidden).

We have implemented two simple versions of tabu search: TS-RPDAG and TS-DAG, which explore the RPDAG and DAG spaces, respectively, using the same operators as their respective greedy versions. The parameters used by these algorithms are the length $tll$ of the tabu list and the number $tsit$ of iterations required to stop the search process. In our experiments, these values have been fixed as follows: $tll=n$ and $tsit=n(n-1)$, $n$ being the number of variables in the domain. The scoring function and the initial graph are the same as in previous experiments, as well as the collected performance measures, with one exception: as the number of iterations is now fixed (Iter=$tsit$), we compute the iteration where the best graph was found (BIter) instead. The results of these experiments are displayed in Tables 7, 8, 9 and 10.


Table 7: Results for the Alarm databases using Tabu Search
  BDeu Edg H A D I BIter Ind EstEv TEst NVar Time
Alarm-3000
TS-RPDAG -33101 46 2 1 1 0 48 1779747 9596 3044392 3.63 286
TS-DAG -33115 51 11 7 2 2 129 1320696 8510 2645091 3.59 391
Alarm-5000
TS-RPDAG -54761 46 2 1 1 0 48 1779579 10471 3031966 3.61 421
TS-DAG -54762 47 3 2 1 0 720 1384990 11113 2773643 3.58 541
Alarm-10000
TS-RPDAG -108432 45 1 0 1 0 47 1764670 10671 3020165 3.66 735
TS-DAG -108442 50 6 5 1 0 284 1385065 11014 2773795 3.60 862



Table 8: Average results for the Insurance domain using Tabu Search
  BDeu Edg H A D I BIter Ind EstEv TEst NVar Time
TS-RPDAG -133070 45 18 4 10 4 58 458973 2823 751551 3.44 182
TS-DAG -132788 47 18 5 10 3 415 352125 5345 706225 4.16 428



Table 9: Average results for the Hailfinder domain using Tabu Search
  BDeu Edg H A D I BIter Ind EstEv TEst NVar Time
TS-RPDAG -497872 67 24 12 10 2 67 9189526 19918 1.5223387E7 4.07 3650
TS-DAG -498073 70 35 17 13 5 1631 7512114 22184 1.5031642E7 4.07 4513



Table 10: Results for the UCI databases using Tabu Search
  BDeu Edg BIter Ind EstEv TEst NVar Time
Breast-Cancer
TS-RPDAG -2848 6 6 8698 345 14209 3.03 1.96
TS-DAG -2848 6 6 6806 316 13892 2.87 1.98
crx
TS-RPDAG -5361 19 19 61175 908 98574 3.20 9.26
TS-DAG -5362 20 29 44507 1176 89714 3.17 12.23
Flare2
TS-RPDAG -6728 15 15 23363 616 37190 3.47 9.70
TS-DAG -6726 15 129 18098 681 36665 3.27 10.60
House-Votes
TS-RPDAG -4622 24 180 73561 1144 121206 3.32 11.14
TS-DAG -4619 23 252 56570 1364 113905 3.29 15.30
Mushroom
TS-RPDAG -77002 99 495 209556 4021 350625 4.83 883
TS-DAG -77073 90 450 157280 3455 315975 4.57 1725
Nursery
TS-RPDAG -125717 8 8 4352 251 6525 3.09 28.05
TS-DAG -125717 8 8 3898 237 7991 2.95 26.20


For the synthetic domains, in all the cases, except in one of the insurance databases, the results obtained by TS-RPDAG and RPDAG are the same. This phenomenon also appears for the UCI databases, where only in two databases does TS-RPDAG improve the results of RPDAG. Therefore, the Tabu Search does not contribute significantly to improving the greedy search in the RPDAG space (at least using the selected values for the parameters $tll$ and $tsit$). This is in contrast with the situation in the DAG space, where TS-DAG improves the results obtained by DAG, with the exception of two UCI databases (equal results) and Alarm-3000 (where DAG performs better than TS-DAG).

With respect to the comparison between TS-RPDAG and TS-DAG, we still consider TS-RPDAG to be preferable to TS-DAG on the synthetic domains, although in this case TS-DAG performs better on the insurance domain. For the UCI databases, the two algorithms perform similarly: each algorithm is better than the other on two domains, and both algorithms perform equally on the remaining two domains. TS-RPDAG is somewhat more efficient than TS-DAG with respect to running time.

We have carried out a third series of experiments to compare our learning algorithm based on RPDAGs with other algorithms for learning Bayesian networks. In this case, the comparison is only intended to measure the quality of the learned network. In addition to the DAG-based local and tabu search previously considered, we have also used the following algorithms:

The two independence-based algorithms, PC and BNPC, operate on the space of equivalence classes, whereas K2 explores the space of DAGs which are compatible with a given ordering. We have included the algorithm K2 in the comparison, using a correct ordering and the same scoring function as the RPDAG and DAG-based search methods, in order to test whether our method can outperform the results obtained with a more informed algorithm. The results for the algorithms PC and K2 have been obtained using our own implementations (which are also included in the Elvira software). For BNPC, we used the software package available at http://www.cs.ualberta.ca/~jcheng/bnsoft.htm.

The test domains included in these experiments are Alarm, Insurance, and Hailfinder. In addition to the BDeu values, the number of edges in the learned networks and the Hamming distances, we have collected two additional performance measures:

BIC.-
The value of the BIC (Bayesian Information Criterion) scoring function [63] for the learned network. This value measures the quality of the network using maximum likelihood and a penalty term. Note that BIC is also score-equivalent and decomposable.
KL.-
The Kullback-Leibler distance (cross-entropy) [49] between the probability distribution, $P$, associated to the database (the empirical frequency distribution) and the probability distribution associated to the learned network, $P_G$. Notice that this measure is actually the same as the log probability of the data. We have in fact calculated a decreasing monotonic linear transformation of the Kullback-Leibler distance, because this one has exponential complexity and the transformation can be computed very efficiently: If $P_{G}$ is the joint probability distribution associated to a network $G$, then the KL distance can be written in the following way [22,50]:
\begin{displaymath}
KL(P,P_G)= -H_P({\cal U})+\sum_{x\in {\cal U}} H_P(x)-\!\!\!...
...cal U},Pa_G(x)\neq\emptyset}
\!\!\!\!\!\!\!\! MI_P(x,Pa_G(x))
\end{displaymath} (5)

where $H_P(Z)$ denotes Shannon entropy with respect to the distribution $P$ for the subset of variables $Z$ and $MI_P(x,Pa_G(x))$ is the measure of mutual information between the two sets of variables $\{x\}$ and $Pa_G(x)$. As the first two terms of the expression above do not depend on the graph $G$, our transformation consists in calculating only the third term in equation (5). So, the interpretation of our transformation of the Kullback-Leibler distance is: the higher this value is, the better the network fits the data. However, this measure should be handled with caution, since a high KL value may also indicate overfitting (a network with many edges will probably have a high KL value).

Although for those algorithms whose goal is to optimize the Bayesian score, BDeu is really the metric that should be used to evaluate them, we have also computed BIC and KL because two of the algorithms considered use independence tests instead of a scoring function.

The results of these experiments are displayed in Table 11. The best value for each performance measure and each database is written in bold, and the second best value in italics. These results indicate that our search method in the RPDAG space, in combination with the BDeu scoring function, is competitive with respect to other algorithms: only the TS-DAG algorithm, which uses a more powerful (and more computationally intensive) search heuristic in the DAG space, and, to a lesser extent, the more informed K2 algorithm, perform better than RPDAG in some cases. Observe that both TS-DAG and K2 perform better than RPDAG in terms of KL on four cases from five.


Table 11: Performance measures for different learning algorithms
  BDeu BIC KL Edg H A D I
Alarm-3000
RPDAG -33101 -33930 9.23055 46 2 1 1 0
DAG -33109 -33939 9.23026 47 7 3 2 2
TS-DAG -33115 -33963 9.23047 51 11 7 2 2
PC -36346 -36691 8.06475 37 10 0 9 1
BNPC -33422 -35197 9.11910 43 7 2 5 0
K2 -33127 -34351 9.23184 46 2 1 1 0
Alarm-5000
RPDAG -54761 -55537 9.25703 46 2 1 1 0
DAG -54956 -55831 9.25632 54 16 9 1 6
TS-DAG -54762 -55540 9.25736 47 3 2 1 0
PC -61496 -61822 7.85435 38 16 2 10 4
BNPC -55111 -55804 9.16787 42 4 0 4 0
K2 -54807 -55985 9.25940 47 3 2 1 0
Alarm-10000
RPDAG -108432 -109165 9.27392 45 1 0 1 0
DAG -108868 -110537 9.27809 52 13 7 1 5
TS-DAG -108442 -109188 9.27439 50 6 5 1 0
PC -117661 -117914 8.31704 38 11 1 9 1
BNPC -109164 -109827 9.18884 42 4 0 4 0
K2 -108513 -109647 9.27549 46 2 1 1 0
Insurance
RPDAG -133071 -134495 8.38502 45 18 4 10 4
DAG -133205 -135037 8.39790 49 25 7 10 8
TS-DAG -132788 -134414 8.41467 47 18 5 10 3
PC -139101 -141214 7.75574 33 23 0 20 3
BNPC -134726 -135832 8.21606 37 26 3 18 5
K2 -132615 -134095 8.42471 44 10 1 9 0
Hailfinder
RPDAG -497872 -531138 20.53164 67 24 12 10 2
DAG -498395 -531608 20.48503 75 45 21 13 11
TS-DAG -498073 -516100 20.59372 70 35 17 13 5
PC -591507 -588638 12.65981 49 50 16 33 1
BNPC -503440 -505160 20.61581 64 28 12 15 1
K2 -498149 -531373 20.51822 67 23 12 11 0


The fourth series of experiments attempts to evaluate the behavior of the same algorithms on a dataset which is different from the training set used to learn the network. In order to do so, we have computed the BDeu, BIC, and KL values of the network structure learned using a database, with respect to a different database: for the Alarm domain, the training set is the Alarm-3000 database used previously, and the test set is formed by the 3000 next cases in the Alarm database; for both Insurance and Hailfinder, we selected one of the five databases that we have been using as the training set and another of these databases as the test set. The results are shown in Table 12. We can observe that they are analogous to the results obtained in Table 11, where the same databases were used for training and testing. However, in this case RPDAG also performs better than TS-DAG and K2 in terms of KL. Therefore, the good behavior of our algorithm cannot be attributed to overfitting.


Table 12: Performance measures for the learning algorithms using a different test set
  BDeu BIC KL
Alarm-3000
RPDAG -32920 -33750 9.35839
DAG -32938 -33769 9.35488
TS-DAG -32947 -33793 9.35465
PC -36286 -36632 8.15227
BNPC -33309 -35051 9.23576
K2 -32951 -34165 9.36168
Insurance
RPDAG -132975 -134471 8.44891
DAG -133004 -134952 8.44745
TS-DAG -132810 -134281 8.44538
PC -140186 -143011 7.70182
BNPC -135029 -136125 8.23063
K2 -132826 -134309 8.44671
Hailfinder
RPDAG -497869 -531165 20.58267
DAG -498585 -531913 20.49730
TS-DAG -497983 -505592 20.55420
PC -587302 -584939 12.68256
BNPC -506680 -508462 20.71069
K2 -498118 -531356 20.57876


Finally, we have carried out another series of experiments, which aim to compare our RPDAG algorithm with the algorithm proposed by [17], that searches in the CPDAG space. In this case, we have selected the House-Votes and Mushroom domains (which were two of the datasets used by [17]). In order to approximate our experimental conditions to those described in Chickering's work, we used the BDeu scoring function with a prior equivalent sample size of ten, and a structure prior of $0.001^f$, where $f$ is the number of free parameters in the DAG; moreover, we used five random subsets of the original databases, each containing approximately 70% of the total data (304 cases for House-Votes and 5686 for Mushroom). Table 13 displays the average values across the five datasets of the relative improvement of the per-case score obtained by RPDAG to the per-case score of DAG, as well as the ratio of the time spent by DAG to the time spent by RPDAG. We also show in Table 13 the corresponding values obtained by Chickering (using only one dataset) for the comparison between his CPDAG algorithm and DAG.


Table 13: Comparison with Chickering's work on completed PDAGs
  RPDAG versus DAG CPDAG versus DAG
Relative Time Relative Time
Dataset Improv. Ratio Improv. Ratio
House-Votes 0.0000 1.041 0.0000 1.27
Mushroom 0.0158 1.005 -0.0382 2.81


We may observe that the behavior of RPDAG and CPDAG is somewhat different: although both algorithms are more efficient than DAG, it seems to us that CPDAG runs faster than RPDAG. With respect to effectiveness, both RPDAG and CPDAG obtain exactly the same solution as DAG in the House-Votes domain (no relative improvement); however, in the other domain, RPDAG outperforms DAG (on the five datasets considered) whereas CPDAG performs worse than DAG. In any case, the differences are small (they could be a result of differences in the experimental setup) and a much more systematic experimentation with these algorithms would be necessary in order to establish general conclusions about their comparative behavior.


next up previous
Next: Concluding Remarks Up: Searching for Bayesian Network Previous: Comparison with other Approaches
Luis Miguel de Campos Ibáñez 2003-05-30