Guided Clustering for Selecting Representatives Samples in Chemical Databases | SpringerLink
Skip to main content

Guided Clustering for Selecting Representatives Samples in Chemical Databases

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2023 Workshops (ICCSA 2023)

Abstract

Machine Learning (ML) methods, from unsupervised to supervised algorithms, have been applied to solve several tasks in the Materials Science domain, such as property prediction, design of new chemical compounds, and surrogate models in molecular dynamics simulations. ML methods can also play a fundamental role in screening materials by reducing the number of compounds under scrutiny. This reduction assumes that compounds similarly represented by a given descriptor might have similar properties; thus, an unsupervised ML method, such as the K-Means algorithm, can cluster the data set and deliver a set of representative samples. However, this selection depends on the molecular representation that might not directly relate to the target property. Here, we propose a framework that lets the specialist select a set of representative samples in a guided fashion. In particular, a loop between a clustering algorithm (k-means) and an optimization method (Basin-Hopping) is implemented, which allows the system to learn feature weights to form more homogeneous clusters given the target property. The framework also offers other visual and textual functionalities to support the expert. We evaluate the proposed framework in two scenarios, and the results show that the guidance enhances clustering formations, both in coarse (few and big clusters) and fine (many small clusters) analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdi, H., Williams, L.J.: Principal component analysis. WIREs. Comput. Statist. 2(4), 433–459 (2010). https://doi.org/10.1002/wics.101. https://onlinelibrary.wiley.com/doi/abs/10.1002/wics.101

  2. Bai, L., Liang, J., Cao, F.: Semi-supervised clustering with constraints of different types from multiple information sources. IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1 (2020). https://doi.org/10.1109/TPAMI.2020.2979699

  3. Batista, K.E.A., Ocampo-Restrepo, V.K., Soares, M.D., Quiles, M.G., Piotrowski, M.J., Da Silva, J.L.F.: Ab Initio investigation of \(co_2\) adsorption on \(13\)-atom \(4d\) clusters. J. Chem. Inf. Model. 60(2), 537–545 (2020). https://doi.org/10.1021/acs.jcim.9b00792. https://doi.org/10.1021/acs.jcim.9b00792

  4. Batista, K.E.A., Soares, M.D., Quiles, M.G., Piotrowski, M.J., Da Silva, J.L.F.: Energy decomposition to access the stability changes induced by co adsorption on transition-metal 13-atom clusters. J. Chem. Inf. Model. 61(5), 2294–2301 (2021). https://doi.org/10.1021/acs.jcim.1c00097. https://doi.org/10.1021/acs.jcim.1c00097. pMID: 33939914

  5. Bayada, D.M., Hamersma, H., van Geerestein, V.J.: Molecular diversity and representativity in chemical databases. J. Chem. Inf. Comput. Sci. 39(1), 1–10 (1999)

    Article  Google Scholar 

  6. Blum, V., et al.: Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 180(11), 2175–2196 (2009). https://doi.org/10.1016/j.cpc.2009.06.022. https://doi.org/10.1016/j.cpc.2009.06.022

  7. Boubchir, M., Boubchir, R., Aourag, H.: The principal component analysis as a tool for predicting the mechanical properties of perovskites and inverse perovskites. Chem. Phys. Lett. 798, 139615 (2022)

    Article  Google Scholar 

  8. Brockherde, F., Vogt, L., Li, L., Tuckerman, M.E., Burke, K., Müller, K.R.: Bypassing the Kohn-sham equations with machine learning. Nat. Commun. 8(1), 1–10 (2017)

    Article  Google Scholar 

  9. Butler, K.T., Davies, D.W., Cartwright, H., Isayev, O., Walsh, A.: Machine learning for molecular and materials science. Nature 559(7715), 547 (2018)

    Article  Google Scholar 

  10. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Statist. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101. https://www.tandfonline.com/doi/abs/10.1080/03610927408827101

  11. Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)

    Google Scholar 

  12. Craw, S.: Manhattan Distance, p. 639. Springer, US, Boston, MA (2010). https://doi.org/10.1007/978-0-387-30164-8_506

  13. Engels, M.F., Thielemans, T., Verbinnen, D., Tollenaere, J.P., Verbeeck, R.: Cerberus: a system supporting the sequential screening process. J. Chem. Inf. Comput. Sci. 40(2), 241–245 (2000)

    Article  Google Scholar 

  14. Felício-Sousa, P., et al.: Ab initio insights into the structural, energetic, electronic, and stability properties of mixed \(ce_nzr_{15-n}o_{30}\) nanoclusters. Phys. Chem. Chem. Phys. 21(48), 26637–26646 (2019). https://doi.org/10.1039/c9cp04762j. https://doi.org/10.1039/c9cp04762j

  15. Havu, V., Blum, V., Havu, P., Scheffler, M.: Efficient integration for all-electron electronic structure calculation using numeric basis functions. J. Comput. Phys. 228(22), 8367–8379 (2009). https://doi.org/10.1016/j.jcp.2009.08.008. https://doi.org/10.1016/j.jcp.2009.08.008

  16. Hkdh, B.: Neural networks in materials science. ISIJ Int. 39(10), 966–979 (1999)

    Article  Google Scholar 

  17. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). https://doi.org/10.1145/331499.331504. https://dx.doi.org/10.1145/331499.331504

  18. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3), 264–323 (Sep1999) 10.1145/331499.331504, https://doi.org/10.1145/331499.331504

  19. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011. https://www.sciencedirect.com/science/article/pii/S0167865509002323. Award winning papers from the 19th International Conference on Pattern Recognition (ICPR)

  20. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982)

    Article  MathSciNet  Google Scholar 

  21. van Laarhoven P.J.M., A.E.: Simulated annealing. In: Simulated Annealing: Theory and Applications, vol. 37, pp. 7–15. Springer, Dordrecht (1987). https://doi.org/10.1007/978-94-015-7744-1_2

  22. Lo, Y.C., Rensi, S.E., Torng, W., Altman, R.B.: Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 23(8), 1538–1546 (2018)

    Article  Google Scholar 

  23. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). https://www.jmlr.org/papers/v9/vandermaaten08a.html

  24. McGregor, M.J., Pallai, P.V.: Clustering of large databases of compounds: using the mdl “keys” as structural descriptors. J. Chem. Inf. Comput. Sci. 37(3), 443–448 (1997)

    Google Scholar 

  25. de Mendonça, J.P.A., Calderan, F.V., Lourenço, T.C., Quiles, M.G., Da Silva, J.L.F.: Theoretical framework based on molecular dynamics and data mining analyses for the study of potential energy surfaces of finite-size particles. J. Chem. Inf. Model. 62(22), 5503–5512 (2022). https://doi.org/10.1021/acs.jcim.2c00957. https://doi.org/10.1021/acs.jcim.2c00957. pMID: 36302503

  26. Morgan, D., Jacobs, R.: Opportunities and challenges for machine learning in materials science. Annu. Rev. Mater. Res. 50(1), 71–103 (2020). https://doi.org/10.1146/annurev-matsci-070218-010015

    Article  Google Scholar 

  27. Nielson, K.D., van Duin, A.C.T., Oxgaard, J., Deng, W.Q., Goddard, W.A.: Development of the ReaxFF reactive force field for describing transition metal catalyzed reactions, with application to the initial stages of the catalytic formation of carbon nanotubes. J. Phys. Chem. A 109, 493–499 (2005)

    Article  Google Scholar 

  28. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5

  29. Olson, B., Hashmi, I., Molloy, K., Shehu, A.: Basin hopping as a general and versatile optimization framework for the characterization of biological macromolecules. Advances in Artificial Intelligence 2012 (2012). https://doi.org/10.1155/2012/674832

  30. Perdew, J.P., Ernzerhof, M., Burke, K.: Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105(22), 9982–9985 (1996). https://doi.org/10.1063/1.472933

  31. Rondina, G.G., Da Silva, J.L.F.: Revised basin-hopping Monte Carlo algorithm for structure optimization of clusters and nanoparticles. J. Chem. Inf. Model. 53(9), 2282–2298 (2013). https://doi.org/10.1021/ci400224z

  32. Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)

    Google Scholar 

  33. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7. https://www.sciencedirect.com/science/article/pii/0377042787901257

  34. van Duin, A.C.T., Dasgupta, S., Lorant, F., Goddard, W.A.: ReaxFF: a reactive force field for hydrocarbons. J. Phys. Chem. A 105, 9396–9409 (2001)

    Article  Google Scholar 

  35. van Lenthe, E., Snijders, J.G., Baerends, E.J.: The zero-order regular approximation for relativistic effects: the effect of spin-orbit coupling in closed shell molecules. J. Chem. Phys. 105(15), 6505–6516 (1996). https://doi.org/10.1063/1.472460

  36. Venna, J., Kaski, S.: Local multidimensional scaling. Neural Netw. 19(6), 889–899 (2006). https://doi.org/10.1016/j.neunet.2006.05.014. https://www.sciencedirect.com/science/article/pii/S0893608006000724. Advances in Self Organising Maps - WSOM2005

  37. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means clustering with background knowledge. In: ICML, vol. 1, pp. 577–584 (2001)

    Google Scholar 

  38. Wales, D.J., Doye, J.P.K.: Global optimization by basin-hopping and the lowest energy structures of Lennard-jones clusters containing up to 110 atoms. J. Phys. Chemis. A 101(28), 5111–5116 (1997). https://doi.org/10.1021/jp970984n

  39. Ward, J.H., Jr.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)

    Article  MathSciNet  Google Scholar 

  40. Yang, X.S.: Introduction to Mathematical Optimization: From Linear Programming to Metaheuristics. Cambridge International2 Science Publishing (2008)

    Google Scholar 

  41. Zheng, J., Lu, T., Lian, Z., Li, M., Lu, W.: Machine learning assisted classification of post-treatment amines for increasing the stability of organic-inorganic hybrid perovskites. Mater. Today Commun. 35, 105902 (2023)

    Article  Google Scholar 

  42. Zibordi-Besse, L., Seminovski, Y., Rosalino, I., Guedes-Sobrinho, D., Da Silva, J.L.F.: Physical and chemical properties of unsupported \((mo_2)_n\) clusters for \(m\) = \(ti\), \(zr\), or \(ce\) and \(n = 1--15\): A density functional theory study combined with the tree-growth scheme and euclidean similarity distance algorithm. J. Phys. Chem. C 122(48), 27702–27712 (2018). https://doi.org/10.1021/acs.jpcc.8b08299

Download references

Acknowledgements

The authors gratefully acknowledge support from FAPESP (São Paulo Research Foundation) and Shell, projects No. \(2017/11631-2\), \(2018/21401-7\) and \(2022/09285-7\), and the strategic importance of the support given by ANP (Brazil’s National Oil, Natural Gas and Biofuels Agency) through the R &D levy regulation. The authors also thank for the infrastructure provided to our computer cluster by the Department of Information Technology - Campus São Carlos.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos G. Quiles .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Calderan, F.V., de Mendonça, J.P.A., Silva, J.L.F.D., Quiles, M.G. (2023). Guided Clustering for Selecting Representatives Samples in Chemical Databases. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14111. Springer, Cham. https://doi.org/10.1007/978-3-031-37126-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37126-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37125-7

  • Online ISBN: 978-3-031-37126-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics