Abstract
Machine Learning (ML) methods, from unsupervised to supervised algorithms, have been applied to solve several tasks in the Materials Science domain, such as property prediction, design of new chemical compounds, and surrogate models in molecular dynamics simulations. ML methods can also play a fundamental role in screening materials by reducing the number of compounds under scrutiny. This reduction assumes that compounds similarly represented by a given descriptor might have similar properties; thus, an unsupervised ML method, such as the K-Means algorithm, can cluster the data set and deliver a set of representative samples. However, this selection depends on the molecular representation that might not directly relate to the target property. Here, we propose a framework that lets the specialist select a set of representative samples in a guided fashion. In particular, a loop between a clustering algorithm (k-means) and an optimization method (Basin-Hopping) is implemented, which allows the system to learn feature weights to form more homogeneous clusters given the target property. The framework also offers other visual and textual functionalities to support the expert. We evaluate the proposed framework in two scenarios, and the results show that the guidance enhances clustering formations, both in coarse (few and big clusters) and fine (many small clusters) analyses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdi, H., Williams, L.J.: Principal component analysis. WIREs. Comput. Statist. 2(4), 433–459 (2010). https://doi.org/10.1002/wics.101. https://onlinelibrary.wiley.com/doi/abs/10.1002/wics.101
Bai, L., Liang, J., Cao, F.: Semi-supervised clustering with constraints of different types from multiple information sources. IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1 (2020). https://doi.org/10.1109/TPAMI.2020.2979699
Batista, K.E.A., Ocampo-Restrepo, V.K., Soares, M.D., Quiles, M.G., Piotrowski, M.J., Da Silva, J.L.F.: Ab Initio investigation of \(co_2\) adsorption on \(13\)-atom \(4d\) clusters. J. Chem. Inf. Model. 60(2), 537–545 (2020). https://doi.org/10.1021/acs.jcim.9b00792. https://doi.org/10.1021/acs.jcim.9b00792
Batista, K.E.A., Soares, M.D., Quiles, M.G., Piotrowski, M.J., Da Silva, J.L.F.: Energy decomposition to access the stability changes induced by co adsorption on transition-metal 13-atom clusters. J. Chem. Inf. Model. 61(5), 2294–2301 (2021). https://doi.org/10.1021/acs.jcim.1c00097. https://doi.org/10.1021/acs.jcim.1c00097. pMID: 33939914
Bayada, D.M., Hamersma, H., van Geerestein, V.J.: Molecular diversity and representativity in chemical databases. J. Chem. Inf. Comput. Sci. 39(1), 1–10 (1999)
Blum, V., et al.: Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 180(11), 2175–2196 (2009). https://doi.org/10.1016/j.cpc.2009.06.022. https://doi.org/10.1016/j.cpc.2009.06.022
Boubchir, M., Boubchir, R., Aourag, H.: The principal component analysis as a tool for predicting the mechanical properties of perovskites and inverse perovskites. Chem. Phys. Lett. 798, 139615 (2022)
Brockherde, F., Vogt, L., Li, L., Tuckerman, M.E., Burke, K., Müller, K.R.: Bypassing the Kohn-sham equations with machine learning. Nat. Commun. 8(1), 1–10 (2017)
Butler, K.T., Davies, D.W., Cartwright, H., Isayev, O., Walsh, A.: Machine learning for molecular and materials science. Nature 559(7715), 547 (2018)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Statist. 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101. https://www.tandfonline.com/doi/abs/10.1080/03610927408827101
Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)
Craw, S.: Manhattan Distance, p. 639. Springer, US, Boston, MA (2010). https://doi.org/10.1007/978-0-387-30164-8_506
Engels, M.F., Thielemans, T., Verbinnen, D., Tollenaere, J.P., Verbeeck, R.: Cerberus: a system supporting the sequential screening process. J. Chem. Inf. Comput. Sci. 40(2), 241–245 (2000)
Felício-Sousa, P., et al.: Ab initio insights into the structural, energetic, electronic, and stability properties of mixed \(ce_nzr_{15-n}o_{30}\) nanoclusters. Phys. Chem. Chem. Phys. 21(48), 26637–26646 (2019). https://doi.org/10.1039/c9cp04762j. https://doi.org/10.1039/c9cp04762j
Havu, V., Blum, V., Havu, P., Scheffler, M.: Efficient integration for all-electron electronic structure calculation using numeric basis functions. J. Comput. Phys. 228(22), 8367–8379 (2009). https://doi.org/10.1016/j.jcp.2009.08.008. https://doi.org/10.1016/j.jcp.2009.08.008
Hkdh, B.: Neural networks in materials science. ISIJ Int. 39(10), 966–979 (1999)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999). https://doi.org/10.1145/331499.331504. https://dx.doi.org/10.1145/331499.331504
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3), 264–323 (Sep1999) 10.1145/331499.331504, https://doi.org/10.1145/331499.331504
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011. https://www.sciencedirect.com/science/article/pii/S0167865509002323. Award winning papers from the 19th International Conference on Pattern Recognition (ICPR)
Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982)
van Laarhoven P.J.M., A.E.: Simulated annealing. In: Simulated Annealing: Theory and Applications, vol. 37, pp. 7–15. Springer, Dordrecht (1987). https://doi.org/10.1007/978-94-015-7744-1_2
Lo, Y.C., Rensi, S.E., Torng, W., Altman, R.B.: Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 23(8), 1538–1546 (2018)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). https://www.jmlr.org/papers/v9/vandermaaten08a.html
McGregor, M.J., Pallai, P.V.: Clustering of large databases of compounds: using the mdl “keys” as structural descriptors. J. Chem. Inf. Comput. Sci. 37(3), 443–448 (1997)
de Mendonça, J.P.A., Calderan, F.V., Lourenço, T.C., Quiles, M.G., Da Silva, J.L.F.: Theoretical framework based on molecular dynamics and data mining analyses for the study of potential energy surfaces of finite-size particles. J. Chem. Inf. Model. 62(22), 5503–5512 (2022). https://doi.org/10.1021/acs.jcim.2c00957. https://doi.org/10.1021/acs.jcim.2c00957. pMID: 36302503
Morgan, D., Jacobs, R.: Opportunities and challenges for machine learning in materials science. Annu. Rev. Mater. Res. 50(1), 71–103 (2020). https://doi.org/10.1146/annurev-matsci-070218-010015
Nielson, K.D., van Duin, A.C.T., Oxgaard, J., Deng, W.Q., Goddard, W.A.: Development of the ReaxFF reactive force field for describing transition metal catalyzed reactions, with application to the initial stages of the catalytic formation of carbon nanotubes. J. Phys. Chem. A 109, 493–499 (2005)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Olson, B., Hashmi, I., Molloy, K., Shehu, A.: Basin hopping as a general and versatile optimization framework for the characterization of biological macromolecules. Advances in Artificial Intelligence 2012 (2012). https://doi.org/10.1155/2012/674832
Perdew, J.P., Ernzerhof, M., Burke, K.: Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105(22), 9982–9985 (1996). https://doi.org/10.1063/1.472933
Rondina, G.G., Da Silva, J.L.F.: Revised basin-hopping Monte Carlo algorithm for structure optimization of clusters and nanoparticles. J. Chem. Inf. Model. 53(9), 2282–2298 (2013). https://doi.org/10.1021/ci400224z
Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7. https://www.sciencedirect.com/science/article/pii/0377042787901257
van Duin, A.C.T., Dasgupta, S., Lorant, F., Goddard, W.A.: ReaxFF: a reactive force field for hydrocarbons. J. Phys. Chem. A 105, 9396–9409 (2001)
van Lenthe, E., Snijders, J.G., Baerends, E.J.: The zero-order regular approximation for relativistic effects: the effect of spin-orbit coupling in closed shell molecules. J. Chem. Phys. 105(15), 6505–6516 (1996). https://doi.org/10.1063/1.472460
Venna, J., Kaski, S.: Local multidimensional scaling. Neural Netw. 19(6), 889–899 (2006). https://doi.org/10.1016/j.neunet.2006.05.014. https://www.sciencedirect.com/science/article/pii/S0893608006000724. Advances in Self Organising Maps - WSOM2005
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means clustering with background knowledge. In: ICML, vol. 1, pp. 577–584 (2001)
Wales, D.J., Doye, J.P.K.: Global optimization by basin-hopping and the lowest energy structures of Lennard-jones clusters containing up to 110 atoms. J. Phys. Chemis. A 101(28), 5111–5116 (1997). https://doi.org/10.1021/jp970984n
Ward, J.H., Jr.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Yang, X.S.: Introduction to Mathematical Optimization: From Linear Programming to Metaheuristics. Cambridge International2 Science Publishing (2008)
Zheng, J., Lu, T., Lian, Z., Li, M., Lu, W.: Machine learning assisted classification of post-treatment amines for increasing the stability of organic-inorganic hybrid perovskites. Mater. Today Commun. 35, 105902 (2023)
Zibordi-Besse, L., Seminovski, Y., Rosalino, I., Guedes-Sobrinho, D., Da Silva, J.L.F.: Physical and chemical properties of unsupported \((mo_2)_n\) clusters for \(m\) = \(ti\), \(zr\), or \(ce\) and \(n = 1--15\): A density functional theory study combined with the tree-growth scheme and euclidean similarity distance algorithm. J. Phys. Chem. C 122(48), 27702–27712 (2018). https://doi.org/10.1021/acs.jpcc.8b08299
Acknowledgements
The authors gratefully acknowledge support from FAPESP (São Paulo Research Foundation) and Shell, projects No. \(2017/11631-2\), \(2018/21401-7\) and \(2022/09285-7\), and the strategic importance of the support given by ANP (Brazil’s National Oil, Natural Gas and Biofuels Agency) through the R &D levy regulation. The authors also thank for the infrastructure provided to our computer cluster by the Department of Information Technology - Campus São Carlos.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Calderan, F.V., de Mendonça, J.P.A., Silva, J.L.F.D., Quiles, M.G. (2023). Guided Clustering for Selecting Representatives Samples in Chemical Databases. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14111. Springer, Cham. https://doi.org/10.1007/978-3-031-37126-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-37126-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37125-7
Online ISBN: 978-3-031-37126-4
eBook Packages: Computer ScienceComputer Science (R0)