Abstract
This study explores the impact of pretraining Graph Transformers using atom-level quantum-mechanical features for molecular property modeling. We utilize the ADMET Therapeutic Data Commons datasets to evaluate the benefits of this approach. Our results show that pretraining on quantum atomic properties improves the performance of the Graphormer model. We conduct comparisons with two other pretraining strategies: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and another using a self-supervised atom masking technique. Additionally, we employ a spectral analysis of Attention Rollout matrices to understand the underlying reasons for these performance enhancements. Our findings suggest that models pretrained on atom-level quantum mechanics are better at capturing low-frequency Laplacian eigenmodes from the molecular graphs, which correlates with improved outcomes on most evaluated downstream tasks, as measured by our custom metric.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
In recent years, the application of deep learning techniques has brought about a paradigm shift in molecular representation learning, playing a pivotal role in a wide array of biochemical endeavors including property modeling and drug design [3, 5,6,7, 18, 20]. Leveraging deep learning methodologies has enabled researchers to extract intricate features from molecular data, thereby enhancing our understanding of molecular structures and their interactions. However, despite the remarkable successes achieved, challenges such as data scarcity and generalizability remain pertinent concerns in the field [3, 4, 8, 10, 12]. To address these challenges, the concept of pretraining models on related tasks or employing self-supervised learning strategies has gained significant traction. Pretraining serves as a means to provide models with a foundational understanding of molecular structures, enabling them to learn meaningful representations even in the presence of limited or noisy labeled data. By leveraging pretraining techniques, researchers aim to enhance model generalizability and performance across a spectrum of downstream tasks [15, 19, 25,26,27].
In this context, our study focuses on investigating the impact of pretraining on atom-level quantum mechanical (QM) properties, associated with fundamental aspect of molecular behavior with profound implications in biochemical research [2] and present in an incresing number of public datasets [14, 17, 21], implemented on Graphormer neural network [28], an instance of the increasingly popular family of Graph Transformer (GT) architectures [22]. Specifically, we compared the efficacy of such pretraining with alternative strategies such as pretraining on a molecular quantum property (HOMO-LUMO gap) and masking, an atom-level self-supervised pretraining method. As downstream tasks that are relevant for applications in the pharmaceutical industry we utilized the ADMET properties dataset from the Therapeutics Data Commons (TDC) [16]. For each pretraining technique and downstream property we compared the model performance with a spectral analysis of the Attention Rollout matrix, to understand in approximation the contributing factors to the model. This analysis reveals that models pretrained with atom-level quantum properties and with masking extract graph spectral properties in the form of Laplacian eigenvectors. Moreover, we observe that models pretrained with atom-level quantum properties can extract more low-frequency Laplacian eigenmodes from the input graph signal and we demonstrate how this effect correlates with improved performances on a good part of the downstream tasks.
2 Methods
We consider a custom implementation of Graphormer [24, 28] as an instance of network that belongs to the category of GTs. As baseline we employed the non-pretrained Graphormer version, which was compared with pretrained models for a total of 8 different cases: one per each of the 4 atom-resolved QM properties (atomic charges, NMR shielding constants, electrophilic and nucleophilic Fukui function indexes), one considering all atomic properties in a multi-task setting, one for the considered molecular property (HOMO-LUMO gap), and one for masking node pretraining. A spectral analysis of the Attention Rollout matrix \(\tilde{A}\) is then performed to gain insights into the behaviour of each obtained model.
Model Description. Graphormer is a GT where the input molecule is seen as a graph where atoms are nodes and bonds are edges. This model in general works by encoding the atoms in the molecule tokenized based on their atom type, and then repeatedly applying self-attention layers with an internal bias term before the softmax. This term is based on the topological distance matrix of the molecular graph and allows to encode the structural information of the molecular graph. In particular, the network employed in this work is an implementation of Graphormer from [24], inspired by the implementation from [28]. In this implementation the centrality encoder is adapted from using only explicit neighbours to including both explicit atoms and implicit hydrogens. As a result of the combination of this modified centrality encoding together with the usual atom type encoder, the hybridization of atoms is handled implicitly. For this reason this implementation does not present any edge encoder component.
Datasets. For pretraining, we used a publicly available dataset [13] consisting of 136k organic molecules and containing, among other things, atomic properties calculated with quantum chemistry methods. Each molecule is represented by a single conformer generated using the Merck Molecular Force Field (MMFF94s) in RDKit library. The initial geometry for the lowest-lying conformer was then optimized at the GFN2-xtb level of theory followed by refinement of the electronic structure with DFT (B3LYP/def2svp). Notice that while the 3D structure is used for the computation of the properties, this is not used in the model where the molecule is represented as 2D input (graph). The advantage of the described dataset is several reported atomic properties: charge, electrophilicity, and nucleophilicity Fukui indexes and an NMR shielding constant. The same set of molecules was used for masked node pretraining. Another pretraining dataset, PCQM4Mv2, consists of a single molecular property per molecule, a HOMO-LUMO gap that was also calculated using quantum chemistry methods https://ogb.stanford.edu/docs/lsc/pcqm4mv2/. It was curated under the PubChemQC project [23]. For the benchmarking of the obtained pretrained models, we used the absorption, distribution, metabolism, excretion, and toxicity (ADMET) group of the TDC dataset, consisting of 9 regression and 13 binary classification tasks for modeling biochemical molecular properties https://tdcommons.ai/benchmark/admet_group/overview/.
Atom-Level Quantum Pretraining. The pretraining on atom-level quantum mechanical properties is achieved via regression task. In the model, each node corresponds to a heavy (non-hydrogen) atom. Accordingly, the obtained node embeddings, are used to train atom-level properties via a linear layer. The model is trained on the dataset from [13] on each one of the available atomic properties, as well as on all of them at the same time in a multi-task setting. As a result, we obtain from this pretraining 5 different pretrained models. In each case except for HOMO-LUMO gap the model was trained as a regression task using L1 loss. A batch size of 100 was used with a fixed learning rate of \(10^{-4}\). In the case of HOMO-LUMO pretraining a triangular cyclic scheduling was employed with a minimum value of \(2\times 10^{-5}\) and a maximum value of \(2\times 10^{-4}\). The training was stopped using an early stopping criterion with patience of 100 epochs. For what concerns labels, the properties were not scaled except for a constant scaling factor of \(10^{-2}\) for NMR shielding constants as we observed it to helped convergence.
Molecule-Level Quantum Pretraining. The pretraining on molecular quantum properties is achieved via a simple regression task where the output is obtained by applying a linear layer to the class token embedding at the last layer of the network. The model is trained on the modeling of HOMO-LUMO gap on the PCQM4Mv2 dataset. We used the same training hyperparameters as the ones indicated in 2. As a result of this pretraining we obtain an additional pretrained model to consider for the downstream tasks.
Masking. Masking pretraining is carried out in a similar way to what is usually done in BERT-based models [9, 11]. This procedure entails randomly masking 15% of the input graph node tokens by replacing them with the mask token, and then training the model to restore the correct node type from the masked embedding as a multi-class classification task. The model is trained on the molecular structures present in the dataset used for atomic QM properties. As a result, we obtain one additional pretrained model to consider for the downstream tasks. The hyperparmeters used for this pretraining are the same as the ones used in 2, while the loss employed is a cross entropy loss.
Downstream Tasks. The training and testing on downstream tasks is carried out on the ADMET group from the TDC dataset in the same way as any molecular property modeling. For splittings and evaluation metrics we follow the guidelines of the benchmark group that we consider, hence we refer to [16]. The pretrained models are fine tuned for each downstream task by training without freezing any layer. Additionally, we also train a model from scratch, obtaining a total of 8 final models per each of the 5 default train/validation splitting seeds on each task (considering 22 tasks, 5 seeds and 8 models we obtain a total of 880 models). The hyperparameters used in each downstream task are the same: the batch size used is 32, while for what concerns the learning rate a triangular cyclic scheduling was employed with a minimum value of \(2\times 10^{-5}\) and a maximum value of \(2\times 10^{-4}\). The training is stopped with an early stopping criterion with patience of 200. The loss used for regression tasks is L1 loss, while for classification tasks a censored regression approach is used using again L1 loss with right censor set at 0 for negative examples and left censor set at 1 for positive examples. For what concerns regression labels, given the diversity of the tasks we opted for a standard scaling. Finally, the performances on each task’s test set are obtained per each pretraining case by taking mean and standard deviation of the performances obtained by the 5 models coming from the 5 different training/validation splits.
Spectral Analysis of Attention Rollout. To have a better understanding of the mechanism behind the pretrained models’ improvements, we shift our focus on the analysis of attention weights. What we aim to understand is along which directions the input molecular representation is decomposed when passed through a given model. In order to do so we start by considering the Attention Rollout matrix [1] \(\tilde{A}\) as a proxy for the model’s action on the input. While this approximation is a strong one, as we will see it provides a number of non-trivial insights. For the definition of \(\tilde{A}\) we refer to [1]. We start by considering a simple spectral decomposition of \(\tilde{A}\) (from here on we will make use of the bra-ket notation):
with \(a_i\in \mathbb {C}\) and \(|a_0|\ge |a_1|\ge ...\ge |a_{N-1}|\) and, based on an empirical observation on one of the pretrained Graphormers (see Fig. 1), we analyse the similarity of the eigenvectors \(\mathinner {|{a_i}\rangle }\) with the eigenvectors of the Laplacian matrix L of the input molecular graph decomposed as
with \(l_0\le l_1\le ...\le l_{N-1}\). In particular, by considering the overlap matrix \(C_{ij} = |\mathinner {\langle {l_i|a_j}\rangle }|\) we study both how many Laplacian modes are used as models’ eigendirections as well as how relevant they are as fraction of the non-trivial spectrum of \(\tilde{A}\) (by non-trivial we mean \(i\ne 0\) as by construction \(|\mathinner {\langle {l_0|a_0}\rangle }|=1\) for properties of L and \(\tilde{A}\)). This fraction is quantified by considering \(\eta = \frac{\sum _{i\in \mathcal {U}\setminus 0} |a_i|}{\sum _{i=1}^{i=N-1} |a_i|}\) where \(\mathcal {U} = \{j | \max _j C_{ij}\ge 0.9\; \textrm{for}\; i \in \left( 0,1,2,...,N-1\right) \}\) with 0.9 being a chosen arbitrary threshold for similarity. Based on these quantities, we define a metric that factors everything together as:
where \(\varTheta \) is the Heaviside function. We then evaluate \(\zeta \) averaged over the test set of each downstream task reporting per each architecture the distribution across tasks for fixed pretraining condition, and also analyse for every task if the model ranking in peformance correlates with the ranking coming from the evaluation of the average \(\zeta \) over that test set. Finally, for this reason we make use of the Spearman’s rank coefficient and consider performances as the higher the better (e.g. we consider MAE with a negative sign but ROC-AUC with positive sign).
3 Results and Discussion
Pretraining on Atom-Resolved Tasks Give the Best Overall Performances. Model performances obtained for the downstream tasks are summarized in Table 1. The table reveals, among other things, that the models trained from scratch or pretrained on the HOMO-LUMO gap (molecular proeperty) are never among the top performers. The superior performance of the models pretrained on atom-level properties is remarkable considering that the HOMO-LUMO gap dataset contains \(\sim 20\) times more molecules than present in the dataset used for pretraining on atomic QM properties and for masking.
The Right Atomic QM Pretraining Usually Gives the Best Performances. In the same table it is also possible to count which pretraining gives most frequently the best results. Despite the fact that results can be quite close, if we rank the models by average value of the metric as done in the TDC leaderboard, the models that demonstrate the highest number of top performances were pretrained using all the atomic QM properties with 10 and pretrained with masking with 6 top results, respectively. Atom-level QM pretraining as a group reveals even higher superiority over studied alternatives: that is in 17 out of 22 downstream tasks the correct choice of atom-level QM pretraining provides the top performant model.
Atom-Level QM Pretraining Boosts the Spectral Perception of Molecules. We evaluate the metric \(\zeta \) defined in Eq. 3 as described in the Sect. 2 obtaining a distribution of 22 values over the downstream tasks per each pretraining. The result is reported in Fig. 2 as a set of violin plots. Firstly, we clearly see that models trained from scratch or pretrained on HOMO-LUMO gap present values of \(\zeta \) that are close to 0 indicating little to no presence of non-trivial Laplacian eigenmodes in the spectrum of their \(\tilde{A}\) matrix. On the contrary, every atom-resolved pretrained model (including masking) presents nonzero values of \(\zeta \) across the downstream tasks raging from \(\sim 1\) to \(\sim 5\). Within these last group of models we can clearly notice how pretraining on the atom-level QM properties provides the strongest boost in perception of graph Laplacian eigenmodes. In particular, the model pretrained using all properties in a multi-task fashion and using only NMR data present the highest values of \(\zeta \), followed by the models pretrained on charges, nucleophilic and electrophilic Fukui functions.
A Better Spectral Perception of Molecules Usually Correlates with Better Performances. As described in the Sect. 2, we proceed to analyse the Spearman’s rank coefficient \({\textbf {r}}_S\) between \(\zeta \) and performances in each task using the 8 datapoints coming from the different pretraining methods. The results are reported in Fig. 3. We can see that for most tasks (20 out of 22) the value of \({\textbf {r}}_S\) is positive, with 13 tasks presenting \({\textbf {r}}_S\ge 0.5\) and 8 tasks presenting \({\textbf {r}}_S\ge 0.75\). These results are a strong indication that models with a better spectral perception of the molecular graph also demonstrate better performances across different tasks.
4 Conclusions
A Graphormer neural network was pretrained on several tasks to improve its performance in modelling molecular ADMET properties that are relevant to drug discovery using the TDC dataset containing 22 downstream tasks. It was found that out of studied methods, pretraining on atom-level QM properties such as atomic charges, NMR shielding constants and Fukui indexes, or using a masking task similar to the one used in BERT model, significantly improve the performance in comparison to the non-pretrained model. One of atom-level QM property pretraining tasks was found to yield the best results for 17 out of 22 downstream tasks. For comparison, pretraining on a much larger dataset of calculated HOMO-LUMO gaps, a molecular electronic property, brings little or no improvement. Finally, through a spectral analysis of the Attention Rollout matrices, we showed how pretraining on atom-level QM properties improves the model perception of spectral properties of the input molecular graph. In particular, by defining an appropriate metric, we show that this effect correlates with the model performance on most of the downstream tasks.
References
Abnar, S., Zuidema, W.H.: Quantifying attention flow in transformers (2020). https://arxiv.org/abs/2005.00928
Beck, M.E.: Do fukui function maxima relate to sites of metabolism? A critical case study. J. Chem. Inform. Model. 45(2), 273–282 (2005). https://doi.org/10.1021/ci049687n, pMID: 15807488
Born, J., et al.: Chemical representation learning for toxicity prediction. Digit. Disc. 2, 674–691 (2023). https://doi.org/10.1039/D2DD00099G
Broccatelli, F., Trager, R., Reutlinger, M., Karypis, G., Li, M.: Benchmarking accuracy and generalizability of four graph neural networks using large in vitro ADME datasets from different chemical spaces. Mol. Inform. 41(8), 2100321 (2022). https://doi.org/10.1002/minf.202100321, https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.202100321
Bule, M., Jalalimanesh, N., Bayrami, Z., Baeeri, M., Abdollahi, M.: The rise of deep learning and transformations in bioactivity prediction power of molecular modeling tools. Chem. Biol. Drug Des. 98(5), 954–967 (2021). https://doi.org/10.1111/cbdd.13750, https://onlinelibrary.wiley.com/doi/abs/10.1111/cbdd.13750
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., Blaschke, T.: The rise of deep learning in drug discovery. Drug Discov. Today 23(6), 1241–1250 (2018). https://doi.org/10.1016/j.drudis.2018.01.039, https://www.sciencedirect.com/science/article/pii/S1359644617303598
Chuang, K.V., Gunsalus, L.M., Keiser, M.J.: Learning molecular representations for medicinal chemistry. J. Med. Chem. 63(16), 8705–8722 (2020). https://doi.org/10.1021/acs.jmedchem.0c00385, pMID: 32366098
David Z Huang, J.C.B., Bahmanyar, S.S.: The challenges of generalizability in artificial intelligence for ADME/TOX endpoint and activity prediction. Expert Opin. Drug Discov. 16(9), 1045–1056 (2021). https://doi.org/10.1080/17460441.2021.1901685, pMID: 33739897
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
Ektefaie, Y., Shen, A., Bykova, D., Marin, M., Zitnik, M., Farhat, M.: Evaluating generalizability of artificial intelligence models for molecular datasets. bioRxiv (2024). https://doi.org/10.1101/2024.02.25.581982, https://www.biorxiv.org/content/early/2024/02/28/2024.02.25.581982
Fabian, B., et al.: Molecular representation learning with language models and domain-relevant auxiliary tasks. In: Proceedings of the NeurIPS 2020 Workshop on Machine Learning for Molecules (2020)
Glavatskíkh, M., Leguy, J., Hunault, G., Cauchy, T., Da Mota, B.: Dataset’s chemical diversity limits the generalizability of machine learning predictions. J. Cheminform. 11(1), 69 (2019). https://doi.org/10.1186/s13321-019-0391-2
Guan, Y., et al.: Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors. Chem. Sci. 12(6), 2198–2208 (2021). https://doi.org/10.1039/d0sc04823b
Hoja, J., et al.: Qm7-x, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. Sci. Data 8(1), 43 (2021). https://doi.org/10.1038/s41597-021-00812-2
Hu, W., et al.: Strategies for pre-training graph neural networks. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=HJlWWJSFDH
Huang, K., et al.: Artificial intelligence foundation for therapeutic science. Nat. Chem. Biol. 18(10), 1033–1036 (2022). https://doi.org/10.1038/s41589-022-01131-2
Isert, C., Atz, K., Jiménez-Luna, J., Schneider, G.: QMugs, quantum mechanical properties of drug-like molecules. Sci. Data 9(1) (2022). https://doi.org/10.1038/s41597-022-01390-7
Jayatunga, M.K., Xie, W., Ruder, L., Schulze, U., Meier, C.: Ai in small-molecule drug discovery: a coming wave. Nat. Rev. Drug Discov. 21, 175–176 (2022)
Kaufman, B., et al.: COATI: multimodal contrastive pretraining for representing and traversing chemical space. J. Chem. Inform. Model. 64(4), 1145–1157 (2024). https://doi.org/10.1021/acs.jcim.3c01753, pMID: 38316665
Li, M.M., Huang, K., Zitnik, M.: Graph representation learning in biomedicine and healthcare. Nat. Biomed. Eng. 6(12), 1353–1369 (2022). https://doi.org/10.1038/s41551-022-00942-x
Medrano Sandonas, L., et al.: Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. Sci. Data 11(1), 742 (2024)
Müller, L., Galkin, M., Morris, C., Rampášek, L.: Attending to graph transformers. Transactions on Machine Learning Research (2024). https://openreview.net/forum?id=HhbqHBBrfZ
Nakata, M., Shimazaki, T.: PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J. Chem. Inf. Model. 57(6), 1300–1308 (2017). https://doi.org/10.1021/acs.jcim.7b00083
Nugmanov, R., Dyubankova, N., Gedich, A., Wegner, J.K.: Bidirectional graphormer for reactivity understanding: neural network trained to reaction atom-to-atom mapping task. J. Chem. Inform. Model. 62(14), 3307–3315 (2022). https://doi.org/10.1021/acs.jcim.2c00344, pMID: 35792579
Wang, Y., Xu, C., Li, Z., Barati Farimani, A.: Denoise pretraining on nonequilibrium molecules for accurate and transferable neural potentials. J. Chem. Theory Comput. 19(15), 5077–5087 (2023). https://doi.org/10.1021/acs.jctc.3c00289, pMID: 37390120
Xia, J., et al.: Mole-BERT: rethinking pre-training graph neural networks for molecules. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=jevY-DtiZTR
Xia, J., Zhu, Y., Du, Y., Li, S.Z.: A systematic survey of chemical pre-trained models. In: Elkind, E. (ed.) Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23. pp. 6787–6795. International Joint Conferences on Artificial Intelligence Organization (2023). https://doi.org/10.24963/ijcai.2023/760, survey Track
Ying, C., et al.: Do transformers really perform badly for graph representation? In: Advances in Neural Information Processing Systems, vol. 34, pp. 28877–28888. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf
Acknowledgments
This study was funded under the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 956832, “Advanced Machine learning for Innovative Drug Discovery” (AIDD).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this paper
Cite this paper
Fallani, A., Arjona-Medina, J., Chernichenko, K., Nugmanov, R., Wegner, J.K., Tkatchenko, A. (2025). Atom-Level Quantum Pretraining Enhances the Spectral Perception of Molecular Graphs in Graphormer. In: Clevert, DA., Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) AI in Drug Discovery. AIDD 2024. Lecture Notes in Computer Science, vol 14894. Springer, Cham. https://doi.org/10.1007/978-3-031-72381-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-72381-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72380-3
Online ISBN: 978-3-031-72381-0
eBook Packages: Computer ScienceComputer Science (R0)