Learning Bayesian Networks with Non-Decomposable Scores | SpringerLink
Skip to main content

Learning Bayesian Networks with Non-Decomposable Scores

  • Conference paper
  • First Online:
Graph Structures for Knowledge Representation and Reasoning (GKR 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9501))

  • 624 Accesses

Abstract

Modern approaches for optimally learning Bayesian network structures require decomposable scores. Such approaches include those based on dynamic programming and heuristic search methods. These approaches operate in a search space called the order graph, which has been investigated extensively in recent years. In this paper, we break from this tradition, and show that one can effectively learn structures using non-decomposable scores by exploring a more complex search space that leverages state-of-the-art learning systems based on order graphs. We show how the new search space can be used to learn with priors that are not structure-modular (a particular class of non-decomposable scores). We also show that it can be used to efficiently enumerate the \(k\)-best structures, in time that can be up to three orders of magnitude faster, compared to existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    At https://sites.google.com/site/bmmalone/files/urlearning.

  2. 2.

    Approaches based on ILP can in principle handle non-decomposable scores (and constraints), assuming that they can be expressed using a linear cost function (or as linear constraints) [25]. We remark that order-modular priors, which we consider later, are not easy to encode as ILPs (as we need to compute linear extension counts).

  3. 3.

    \({ Pr}_0(G_0) = 1\) as there is a unique graph over zero nodes.

  4. 4.

    For each linear extension \(\pi \) of \(G_i\), there are \((i+1)\) places to insert the \((i+1)\)-th node, then \((i+2)\) places to insert the next, and so on. Thus, there are \((i+1)\cdots n\) ways to extend a given ordering over \(i\) variables to \(n\) variables.

  5. 5.

    In particular, every time that we expand a node \(G\), we can increment each of its children’s linear extension counts by \({\mathrm {\#}G}\). Once we have expanded every parent of a child, the child’s linear extension count is correct.

  6. 6.

    There are systems available for (a) finding optimal DAGs using structure-modular priors, (b) for Bayesian model averaging using order-modular priors, and (c) for jointly optimizing over orders and DAGs, using order-modular priors. These tasks are all discussed in [18], which further states that finding optimal DAGs with order-modular priors is a more challenging problem (where we maximize over DAGs, but sum over orders).

  7. 7.

    We remark, however, that [12] is more specifically concerned with the enumeration of the k-shortest paths. Since we are interested in enumerating the k-closest goal nodes, we remark that some, but not all, of their theoretical analyses applies to our problem. In particular, each distinct goal node in the BN graph may have many paths that can reach it. Hence, once we obtain one goal node, many more shortest-paths may be needed to obtain the next closest (and distinct) goal node. Moreover, we do not need to differentiate between two different paths to the same goal node, as in [12].

  8. 8.

    We remark on another distinction between finding a single optimal DAG, versus enumerating the k-best DAGs. In particular, there are techniques that can guarantee that certain families will not appear in an optimal DAG, which can greatly simplify the learning problem [8, 11, 30, 31]. However, such families may still appear in a k-th best DAG, and hence, these techniques may not be directly applicable.

  9. 9.

    At http://www.cs.iastate.edu/~jtian/Software/UAI-10/KBest.htm.

  10. 10.

    At http://www.cs.york.ac.uk/aig/sw/gobnilp/.

  11. 11.

    Note that GOBNILP is known to be more effective in other regimes, for example, where we can constrain the number of parents that a node can have [21, 34]. However, for our experiments here, we consider the more general case, where we do not assume such a constraint.

References

  1. Bache, K., Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml

  2. Bouckaert, R.R.: Probalistic network construction using the minimum description length principle. In: Proceedings of the European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (ECSQARU), pp. 41–48 (1993)

    Google Scholar 

  3. Brightwell, G., Winkler, P.: Counting linear extensions. Order 8(3), 225–242 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  4. Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, pp. 52–60 (1991)

    Google Scholar 

  5. Chickering, D., Geiger, D., Heckerman, D.: Learning Bayesian networks: search methods and experimental results. In: Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics (AISTATS), pp. 112–128 (1995)

    Google Scholar 

  6. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)

    MATH  Google Scholar 

  7. Cussens, J.: Bayesian network learning with cutting planes. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 153–160 (2011)

    Google Scholar 

  8. Cussens, J.: An upper bound for BDeu local scores. In: European Conference on Artificial Intelligence Workshop on Algorithmic Issues for Inference in Graphical Models (2012)

    Google Scholar 

  9. Cussens, J., Bartlett, M., Jones, E.M., Sheehan, N.A.: Maximum likelihood pedigree reconstruction using integer linear programming. Genet. Epidemiol. 37(1), 69–83 (2013)

    Article  Google Scholar 

  10. Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press, New York (2009)

    Book  MATH  Google Scholar 

  11. De Campos, C.P., Ji, Q.: Efficient structure learning of Bayesian networks using constraints. J. Mach. Learn. Res. 12, 663–689 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Dechter, R., Flerova, N., Marinescu, R.: Search algorithms for m best solutions for graphical models. In: Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (2012)

    Google Scholar 

  13. Felner, A., Goldenberg, M., Sharon, G., Stern, R., Beja, T., Sturtevant, N.R., Schaeffer, J., Holte, R.: Partial-expansion A* with selective node generation. In: Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (2012)

    Google Scholar 

  14. Friedman, N., Koller, D.: Being Bayesian about network structure. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 201–210 (2000)

    Google Scholar 

  15. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)

    Article  Google Scholar 

  16. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)

    MATH  Google Scholar 

  17. Jaakkola, T., Sontag, D., Globerson, A., Meila, M.: Learning Bayesian network structure using LP relaxations. In: Proceedings of the Thirteen International Conference on Artificial Intelligence and Statistics, pp. 358–365 (2010)

    Google Scholar 

  18. Koivisto, M., Sood, K.: Exact Bayesian structure discovery in Bayesian networks. J. Mach. Learn. Res. 5, 549–573 (2004)

    MathSciNet  MATH  Google Scholar 

  19. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge (2009)

    Google Scholar 

  20. Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10, 269–294 (1994)

    Article  Google Scholar 

  21. Malone, B., Kangas, K., Järvisalo, M., Koivisto, M., Myllymäki, P.: Predicting the hardness of learning Bayesian networks. In: Proceedings of the Twenty-Eighth Conference on Artificial Intelligence (2014)

    Google Scholar 

  22. Malone, B., Yuan, C., Hansen, E.: Memory-efficient dynamic programming for learning optimal Bayesian networks. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2011)

    Google Scholar 

  23. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    Google Scholar 

  24. Niinimäki, T.M., Koivisto, M.: Annealed importance sampling for structure learning in Bayesian networks. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI) (2013)

    Google Scholar 

  25. Oates, C.J., Smith, J.Q., Mukherjee, S., Cussens, J.: Exact estimation of multiple directed acyclic graphs. Stat. Comput. 1–15 (2015). http://link.springer.com/journal/11222/onlineFirst/page/2

  26. Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach. Pearson Education, London (2010)

    Google Scholar 

  27. Silander, T., Myllymäki, P.: A simple approach for finding the globally optimal Bayesian network structure. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pp. 445–452 (2006)

    Google Scholar 

  28. Singh, A.P., Moore, A.W.: Finding optimal Bayesian networks by dynamic programming. Technical report, CMU-CALD-050106 (2005)

    Google Scholar 

  29. Suzuki, J.: A construction of Bayesian networks from databases based on an MDL principle. In: Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 266–273 (1993)

    Google Scholar 

  30. Teyssier, M., Koller, D.: Ordering-based search: a simple and effective algorithm for learning Bayesian networks. In: Proceedings of the Twenty-first Conference on Uncertainty in Artificial Intelligence, pp. 584–590 (2005)

    Google Scholar 

  31. Tian, J.: A branch-and-bound algorithm for mdl learning Bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 580–588 (2000)

    Google Scholar 

  32. Tian, J., He, R., Ram, L.: Bayesian model averaging using the k-best Bayesian network structures. In: Proceedings of the Twenty-Six Conference on Uncertainty in Artificial Intelligence, pp. 589–597 (2010)

    Google Scholar 

  33. Yoshizumi, T., Miura, T., Ishida, T.: A* with partial expansion for large branching factor problems. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (2000)

    Google Scholar 

  34. Yuan, C., Malone, B.: Learning optimal Bayesian networks: a shortest path perspective. J. Artif. Intell. Res. 48, 23–65 (2013)

    MathSciNet  MATH  Google Scholar 

  35. Yuan, C., Malone, B., Wu, X.: Learning optimal Bayesian networks using A* search. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, pp. 2186–2191 (2011)

    Google Scholar 

Download references

Acknowledgments

We thank Joseph Barker, Zhaoxing Bu, and Ethan Schreiber for helpful comments and discussions. We also thank James Cussens and Brandon Malone for their comments on an earlier version of this paper. This work was supported in part by ONR grant #N00014-12-1-0423 and NSF grant #IIS-1514253.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eunice Yuh-Jie Chen .

Editor information

Editors and Affiliations

A Proofs for Sect. 4.2

A Proofs for Sect. 4.2

Proof

(Theorem 1 ). The total cost of a path from the root \(G_0\) to a leaf \(G_n\) is:

$$\begin{aligned}&\sum _{G_i \rightarrow G_j} \mathsf {score}(X_j\mathbf{U}_j \mid \mathcal {D}) - \log \frac{{ Pr}_j(G_j)}{{ Pr}_i(G_i)}\\&=\sum _{G_i \rightarrow G_j} \mathsf {score}(X_j\mathbf{U}_j \mid \mathcal {D}) - \log { Pr}_j(G_j) + \log { Pr}_i(G_i)\\&= \mathsf {score}(X_1 \mathbf{U}_1 \mid \mathcal {D}) - \log { Pr}_1(G_1) + \log { Pr}_0(G_0) {} \\&\quad + \mathsf {score}(X_2\mathbf{U}_2 \mid \mathcal {D}) - \log { Pr}_2(G_2) + \log { Pr}_1(G_1) + \ldots {} \\&\quad + \mathsf {score}(X_n\mathbf{U}_n \mid \mathcal {D}) - \log { Pr}_{n}(G_{n}) + \log { Pr}_{n-1}(G_{n-1})\\&= \mathsf {score}(G \mid \mathcal {D}) - \log { Pr}_n(G_n) \end{aligned}$$

as desired.    \(\square \)

Proof

(Theorem 2 ).

$$\begin{aligned} h(G_i)&= \min _{G_n: G_i \leadsto G_n} \sum _{X\mathbf{U}\in G_n - G_i} \mathsf {score}(X\mathbf{U}\mid \mathcal {D}) + \min _{G_n: G_i \leadsto G_n} - \log \frac{{ Pr}_n(G_n)}{{ Pr}_i(G_i)}\\&\le \min _{G_n: G_i \leadsto G_n} \Big ( \sum _{X\mathbf{U}\in G_n - G_i} \mathsf {score}(X\mathbf{U}\mid \mathcal {D}) - \log \frac{{ Pr}_n(G_n)}{{ Pr}_i(G_i)} \Big )\\&= \min _{G_n: G_i \leadsto G_n} g(G_n) - g(G_i) \end{aligned}$$

Since heuristic function h lower-bounds the true cost of to a goal node, it is admissible.    \(\square \)

Below we consider the correctness of Algorithm 1.

Lemma 1

In Algorithm 1:

  1. 1.

    all \(G_i\) that generate \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\) is extracted;

  2. 2.

    when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is extracted from H,

    $$\begin{aligned} f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1}), \end{aligned}$$

    and \(l_{i + 1}\) is the number of linear extensions of \(G_{i + 1}\), i.e., \(\#G_{i+1}\).

where \(h(G_i) = h_1(G_i) - \sum _{j=i+1}^n \log j\).

Proof

Consider a minimum item \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) extracted from H. Below we show by induction that (1) all \(G_i\) such that \(G_i\) generates \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\) (2) \(f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1})\), and \(l_{i + 1}\) is the number of linear extensions of \(G_{i + 1}\), which is also the number of paths from \(G_0\) to \(G_{i + 1}\).

For \(i = 0\), clearly (1) and (2) are true. Assume (1) and (2) are true for all \((G_i, f_i, l_i)\). Then when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is extracted, \(l_{i + 1}\) is the number of paths from \(G_0\) to \(G_{i + 1}\) that pass the \(G_i\) extracted from H before \(G_{i + 1}\). To see this, note that l is the number of path from \(G_0\) to \(G_i\). Moreover, since \(l_{i + 1}\) is the number paths from \(G_0\) to \(G_{i + 1}\) that pass the \(G_i\) extracted from H before \(G_{i + 1}\), when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is in H,

$$\begin{aligned} f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \sum _{G_i \prec G_{i + 1}} N(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1}) + h (G_{i + 1}), \end{aligned}$$

where \(G_i \prec G_{i + 1}\) denotes that \(G_i\) is extracted before \(G_{i + 1}\), and \(N(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1})\) denotes the number of paths \(G_0 \rightarrow \ldots \rightarrow G_i \rightarrow G_{i + 1}\). Note that \(f_{i + 1}\) decreases as more \(G_i\) are extracted.

Consider when \((G_i, f_i, l_i)\) and \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) are both in H. Below we show that all \(G_i\) that generates \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\). Consider

$$\begin{aligned} f_i&= \mathsf {score}(G_i | \mathcal {D}) \\&\quad - \log \sum _{G_{i - 1} \prec G_i} N(G_0 \rightarrow \ldots \rightarrow G_{i - 1} \rightarrow G_i) + h_1(G_i) - \sum _{j = i + 1}^n \log j\\ f_{i + 1}&= \mathsf {score}(G_{i + 1} | \mathcal {D}) \\&\quad - \log \sum _{G^\prime _i \prec G_{i + 1}} N(G_0 \rightarrow \ldots \rightarrow G^\prime _i \rightarrow G_{i + 1})+ h_1 (G_{i + 1}) - \sum _{j = i + 2}^n \log j\\ \end{aligned}$$

We simply need to show \(f_{i + 1} > f_i\). Since \(h_1\) is a consistent heuristic function for learning with \(\mathsf {score}\), \(\mathsf {score}(G_{i + 1} | \mathcal {D}) + h_1(G_{i + 1}) \ge \mathsf {score}(G_i | \mathcal {D}) + h_1 (G_i)\). Then we only need to show

$$\begin{aligned}&(i + 1) \sum _{G_{i - 1} \prec G_i} N(G_0 \rightarrow \ldots \rightarrow G_{i - 1} \rightarrow G_i) \\&> \sum _{G^\prime _i \prec G_{i + 1}} N(G_0 \rightarrow \ldots \rightarrow G^\prime _i \rightarrow G_{i + 1}) \end{aligned}$$

First, for any pair of DAGs \(G_i\) and \(G^\prime _i\) that can generate a DAG \(G_{i+1}\), there exists a unique DAG \(G_{i-1}\) that can generate both \(G_i\) and \(G^\prime _i\). For each \(G^\prime _i\) on the right-hand side, there thus exists a corresponding (and unique) \(G_{i-1}\) on the left-hand side that can generate both \(G^\prime _i\) and \(G_i\). Further, since \(G^\prime _i\) was expanded, \(G_{i-1}\) must also have been expanded (by induction). For each such \(G_{i-1}\), if \(G^\prime _i\) has a linear extension count of \(L\), then \(G_{i-1}\) must have at least a linear extension count of \(L/i\), and hence the corresponding \(N(G_0 \rightarrow \ldots \rightarrow G_{i-1} \rightarrow G_i)\) is at least \(L/i\). On the left-hand side, we the corresponding term is thus at least \((i+1)\cdot L/i > L\). Since this holds for each element of the summation on the right-hand side, the above inequality holds.

Since all \(G_i\) that generates \(G_{i + 1}\) are extracted from H before \(G_{i + 1}\), as a result, \(f_{i + 1} = \mathsf {score}(G_{i + 1} | \mathcal {D}) - \log \#G_{i + 1} + h(G_{i + 1})\), and \(l_{i + 1}\) is the number of all paths from \(G_0\) to \(G_{i + 1}\).

Proof

(of Theorem 4 ). To see the correctness of the algorithm, simply note that by Lemma 1, when \((G_{i + 1}, f_{i + 1}, l_{i + 1})\) is extracted from H, i.e. the open list, \(f_{i + 1} = f(G_{i + 1})\).

Proof

(of Theorem 3 ). By Lemma 1, Algorithm 1 can count the number of linear extensions.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Chen, E.YJ., Choi, A., Darwiche, A. (2015). Learning Bayesian Networks with Non-Decomposable Scores. In: Croitoru, M., Marquis, P., Rudolph, S., Stapleton, G. (eds) Graph Structures for Knowledge Representation and Reasoning. GKR 2015. Lecture Notes in Computer Science(), vol 9501. Springer, Cham. https://doi.org/10.1007/978-3-319-28702-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28702-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28701-0

  • Online ISBN: 978-3-319-28702-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics