Abstract
Modern approaches for optimally learning Bayesian network structures require decomposable scores. Such approaches include those based on dynamic programming and heuristic search methods. These approaches operate in a search space called the order graph, which has been investigated extensively in recent years. In this paper, we break from this tradition, and show that one can effectively learn structures using non-decomposable scores by exploring a more complex search space that leverages state-of-the-art learning systems based on order graphs. We show how the new search space can be used to learn with priors that are not structure-modular (a particular class of non-decomposable scores). We also show that it can be used to efficiently enumerate the \(k\)-best structures, in time that can be up to three orders of magnitude faster, compared to existing approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Approaches based on ILP can in principle handle non-decomposable scores (and constraints), assuming that they can be expressed using a linear cost function (or as linear constraints) [25]. We remark that order-modular priors, which we consider later, are not easy to encode as ILPs (as we need to compute linear extension counts).
- 3.
\({ Pr}_0(G_0) = 1\) as there is a unique graph over zero nodes.
- 4.
For each linear extension \(\pi \) of \(G_i\), there are \((i+1)\) places to insert the \((i+1)\)-th node, then \((i+2)\) places to insert the next, and so on. Thus, there are \((i+1)\cdots n\) ways to extend a given ordering over \(i\) variables to \(n\) variables.
- 5.
In particular, every time that we expand a node \(G\), we can increment each of its children’s linear extension counts by \({\mathrm {\#}G}\). Once we have expanded every parent of a child, the child’s linear extension count is correct.
- 6.
There are systems available for (a) finding optimal DAGs using structure-modular priors, (b) for Bayesian model averaging using order-modular priors, and (c) for jointly optimizing over orders and DAGs, using order-modular priors. These tasks are all discussed in [18], which further states that finding optimal DAGs with order-modular priors is a more challenging problem (where we maximize over DAGs, but sum over orders).
- 7.
We remark, however, that [12] is more specifically concerned with the enumeration of the k-shortest paths. Since we are interested in enumerating the k-closest goal nodes, we remark that some, but not all, of their theoretical analyses applies to our problem. In particular, each distinct goal node in the BN graph may have many paths that can reach it. Hence, once we obtain one goal node, many more shortest-paths may be needed to obtain the next closest (and distinct) goal node. Moreover, we do not need to differentiate between two different paths to the same goal node, as in [12].
- 8.
We remark on another distinction between finding a single optimal DAG, versus enumerating the k-best DAGs. In particular, there are techniques that can guarantee that certain families will not appear in an optimal DAG, which can greatly simplify the learning problem [8, 11, 30, 31]. However, such families may still appear in a k-th best DAG, and hence, these techniques may not be directly applicable.
- 9.
- 10.
- 11.
References
Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
Bouckaert, R.R.: Probalistic network construction using the minimum description length principle. In: Proceedings of the European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (ECSQARU), pp. 41–48 (1993)
Brightwell, G., Winkler, P.: Counting linear extensions. Order 8(3), 225–242 (1991)
Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pp. 52–60 (1991)
Chickering, D., Geiger, D., Heckerman, D.: Learning Bayesian networks: search methods and experimental results. In: Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics (AISTATS), pp. 112–128 (1995)
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)
Cussens, J.: Bayesian network learning with cutting planes. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 153–160 (2011)
Cussens, J.: An upper bound for BDeu local scores. In: European Conference on Artificial Intelligence Workshop on Algorithmic Issues for Inference in Graphical Models (2012)
Cussens, J., Bartlett, M., Jones, E.M., Sheehan, N.A.: Maximum likelihood pedigree reconstruction using integer linear programming. Genet. Epidemiol. 37(1), 69–83 (2013)
Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press, New York (2009)
De Campos, C.P., Ji, Q.: Efficient structure learning of Bayesian networks using constraints. J. Mach. Learn. Res. 12, 663–689 (2011)
Dechter, R., Flerova, N., Marinescu, R.: Search algorithms for m best solutions for graphical models. In: Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (2012)
Felner, A., Goldenberg, M., Sharon, G., Stern, R., Beja, T., Sturtevant, N.R., Schaeffer, J., Holte, R.: Partial-expansion A* with selective node generation. In: Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (2012)
Friedman, N., Koller, D.: Being Bayesian about network structure. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 201–210 (2000)
Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)
Jaakkola, T., Sontag, D., Globerson, A., Meila, M.: Learning Bayesian network structure using LP relaxations. In: Proceedings of the Thirteen International Conference on Artificial Intelligence and Statistics, pp. 358–365 (2010)
Koivisto, M., Sood, K.: Exact Bayesian structure discovery in Bayesian networks. J. Mach. Learn. Res. 5, 549–573 (2004)
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge (2009)
Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10, 269–294 (1994)
Malone, B., Kangas, K., Järvisalo, M., Koivisto, M., Myllymäki, P.: Predicting the hardness of learning Bayesian networks. In: Proceedings of the Twenty-Eighth Conference on Artificial Intelligence (2014)
Malone, B., Yuan, C., Hansen, E.: Memory-efficient dynamic programming for learning optimal Bayesian networks. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2011)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Niinimäki, T.M., Koivisto, M.: Annealed importance sampling for structure learning in Bayesian networks. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI) (2013)
Oates, C.J., Smith, J.Q., Mukherjee, S., Cussens, J.: Exact estimation of multiple directed acyclic graphs. Stat. Comput. 1–15 (2015). http://link.springer.com/journal/11222/onlineFirst/page/2
Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach. Pearson Education, London (2010)
Silander, T., Myllymäki, P.: A simple approach for finding the globally optimal Bayesian network structure. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pp. 445–452 (2006)
Singh, A.P., Moore, A.W.: Finding optimal Bayesian networks by dynamic programming. Technical report, CMU-CALD-050106 (2005)
Suzuki, J.: A construction of Bayesian networks from databases based on an MDL principle. In: Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 266–273 (1993)
Teyssier, M., Koller, D.: Ordering-based search: a simple and effective algorithm for learning Bayesian networks. In: Proceedings of the Twenty-first Conference on Uncertainty in Artificial Intelligence, pp. 584–590 (2005)
Tian, J.: A branch-and-bound algorithm for mdl learning Bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 580–588 (2000)
Tian, J., He, R., Ram, L.: Bayesian model averaging using the k-best Bayesian network structures. In: Proceedings of the Twenty-Six Conference on Uncertainty in Artificial Intelligence, pp. 589–597 (2010)
Yoshizumi, T., Miura, T., Ishida, T.: A* with partial expansion for large branching factor problems. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (2000)
Yuan, C., Malone, B.: Learning optimal Bayesian networks: a shortest path perspective. J. Artif. Intell. Res. 48, 23–65 (2013)
Yuan, C., Malone, B., Wu, X.: Learning optimal Bayesian networks using A* search. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, pp. 2186–2191 (2011)
Acknowledgments
We thank Joseph Barker, Zhaoxing Bu, and Ethan Schreiber for helpful comments and discussions. We also thank James Cussens and Brandon Malone for their comments on an earlier version of this paper. This work was supported in part by ONR grant #N00014-12-1-0423 and NSF grant #IIS-1514253.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proofs for Sect. 4.2
A Proofs for Sect. 4.2
Proof
(Theorem 1 ). The total cost of a path from the root \(G_0\) to a leaf \(G_n\) is:
as desired. \(\square \)
Proof
(Theorem 2 ).
Since heuristic function h lower-bounds the true cost of to a goal node, it is admissible. \(\square \)
Below we consider the correctness of Algorithm 1.
Lemma 1
In Algorithm 1:
-
1.
all \(G_i\) that generate \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\) is extracted;
-
2.
when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is extracted from H,
$$\begin{aligned} f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1}), \end{aligned}$$and \(l_{i + 1}\) is the number of linear extensions of \(G_{i + 1}\), i.e., \(\#G_{i+1}\).
where \(h(G_i) = h_1(G_i) - \sum _{j=i+1}^n \log j\).
Proof
Consider a minimum item \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) extracted from H. Below we show by induction that (1) all \(G_i\) such that \(G_i\) generates \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\) (2) \(f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1})\), and \(l_{i + 1}\) is the number of linear extensions of \(G_{i + 1}\), which is also the number of paths from \(G_0\) to \(G_{i + 1}\).
For \(i = 0\), clearly (1) and (2) are true. Assume (1) and (2) are true for all \((G_i, f_i, l_i)\). Then when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is extracted, \(l_{i + 1}\) is the number of paths from \(G_0\) to \(G_{i + 1}\) that pass the \(G_i\) extracted from H before \(G_{i + 1}\). To see this, note that l is the number of path from \(G_0\) to \(G_i\). Moreover, since \(l_{i + 1}\) is the number paths from \(G_0\) to \(G_{i + 1}\) that pass the \(G_i\) extracted from H before \(G_{i + 1}\), when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is in H,
where \(G_i \prec G_{i + 1}\) denotes that \(G_i\) is extracted before \(G_{i + 1}\), and \(N(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1})\) denotes the number of paths \(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1}\). Note that \(f_{i + 1}\) decreases as more \(G_i\) are extracted.
Consider when \((G_i, f_i, l_i)\) and \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) are both in H. Below we show that all \(G_i\) that generates \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\). Consider
We simply need to show \(f_{i + 1} > f_i\). Since \(h_1\) is a consistent heuristic function for learning with \(\mathsf {score}\), \(\mathsf {score}(G_{i + 1} | \mathcal {D}) + h_1(G_{i + 1}) \ge \mathsf {score}(G_i | \mathcal {D}) + h_1 (G_i)\). Then we only need to show
First, for any pair of DAGs \(G_i\) and \(G^\prime _i\) that can generate a DAG \(G_{i+1}\), there exists a unique DAG \(G_{i-1}\) that can generate both \(G_i\) and \(G^\prime _i\). For each \(G^\prime _i\) on the right-hand side, there thus exists a corresponding (and unique) \(G_{i-1}\) on the left-hand side that can generate both \(G^\prime _i\) and \(G_i\). Further, since \(G^\prime _i\) was expanded, \(G_{i-1}\) must also have been expanded (by induction). For each such \(G_{i-1}\), if \(G^\prime _i\) has a linear extension count of \(L\), then \(G_{i-1}\) must have at least a linear extension count of \(L/i\), and hence the corresponding \(N(G_0 \rightarrow \ldots \rightarrow G_{i-1} \rightarrow G_i)\) is at least \(L/i\). On the left-hand side, we the corresponding term is thus at least \((i+1)\cdot L/i > L\). Since this holds for each element of the summation on the right-hand side, the above inequality holds.
Since all \(G_i\) that generates \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\), as a result, \(f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1})\), and \(l_{i + 1}\) is the number of all paths from \(G_0\) to \(G_{i + 1}\).
Proof
(of Theorem 4 ). To see the correctness of the algorithm, simply note that by Lemma 1, when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is extracted from H, i.e. the open list, \(f_{i + 1} = f(G_{i + 1})\).
Proof
(of Theorem 3 ). By Lemma 1, Algorithm 1 can count the number of linear extensions.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Chen, E.YJ., Choi, A., Darwiche, A. (2015). Learning Bayesian Networks with Non-Decomposable Scores. In: Croitoru, M., Marquis, P., Rudolph, S., Stapleton, G. (eds) Graph Structures for Knowledge Representation and Reasoning. GKR 2015. Lecture Notes in Computer Science(), vol 9501. Springer, Cham. https://doi.org/10.1007/978-3-319-28702-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-28702-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28701-0
Online ISBN: 978-3-319-28702-7
eBook Packages: Computer ScienceComputer Science (R0)