1 Introduction

In recent years, the application of deep learning techniques has brought about a paradigm shift in molecular representation learning, playing a pivotal role in a wide array of biochemical endeavors including property modeling and drug design [3, 5,6,7, 18, 20]. Leveraging deep learning methodologies has enabled researchers to extract intricate features from molecular data, thereby enhancing our understanding of molecular structures and their interactions. However, despite the remarkable successes achieved, challenges such as data scarcity and generalizability remain pertinent concerns in the field [3, 4, 8, 10, 12]. To address these challenges, the concept of pretraining models on related tasks or employing self-supervised learning strategies has gained significant traction. Pretraining serves as a means to provide models with a foundational understanding of molecular structures, enabling them to learn meaningful representations even in the presence of limited or noisy labeled data. By leveraging pretraining techniques, researchers aim to enhance model generalizability and performance across a spectrum of downstream tasks [15, 19, 25,26,27].

In this context, our study focuses on investigating the impact of pretraining on atom-level quantum mechanical (QM) properties, associated with fundamental aspect of molecular behavior with profound implications in biochemical research [2] and present in an incresing number of public datasets [14, 17, 21], implemented on Graphormer neural network [28], an instance of the increasingly popular family of Graph Transformer (GT) architectures [22]. Specifically, we compared the efficacy of such pretraining with alternative strategies such as pretraining on a molecular quantum property (HOMO-LUMO gap) and masking, an atom-level self-supervised pretraining method. As downstream tasks that are relevant for applications in the pharmaceutical industry we utilized the ADMET properties dataset from the Therapeutics Data Commons (TDC) [16]. For each pretraining technique and downstream property we compared the model performance with a spectral analysis of the Attention Rollout matrix, to understand in approximation the contributing factors to the model. This analysis reveals that models pretrained with atom-level quantum properties and with masking extract graph spectral properties in the form of Laplacian eigenvectors. Moreover, we observe that models pretrained with atom-level quantum properties can extract more low-frequency Laplacian eigenmodes from the input graph signal and we demonstrate how this effect correlates with improved performances on a good part of the downstream tasks.

Fig. 1.
figure 1

Example of the similarity observed between the eigenvectors \(\mathinner {|{a_i}\rangle }\) of the rollout matrix \(\tilde{A}\) and the eigenvectors \(\mathinner {|{l_i}\rangle }\) of the graph Laplacian L for a Graphormer pretrained on atomic QM properties. The molecule used here comes from the TDC dataset.

2 Methods

We consider a custom implementation of Graphormer [24, 28] as an instance of network that belongs to the category of GTs. As baseline we employed the non-pretrained Graphormer version, which was compared with pretrained models for a total of 8 different cases: one per each of the 4 atom-resolved QM properties (atomic charges, NMR shielding constants, electrophilic and nucleophilic Fukui function indexes), one considering all atomic properties in a multi-task setting, one for the considered molecular property (HOMO-LUMO gap), and one for masking node pretraining. A spectral analysis of the Attention Rollout matrix \(\tilde{A}\) is then performed to gain insights into the behaviour of each obtained model.

Model Description. Graphormer is a GT where the input molecule is seen as a graph where atoms are nodes and bonds are edges. This model in general works by encoding the atoms in the molecule tokenized based on their atom type, and then repeatedly applying self-attention layers with an internal bias term before the softmax. This term is based on the topological distance matrix of the molecular graph and allows to encode the structural information of the molecular graph. In particular, the network employed in this work is an implementation of Graphormer from [24], inspired by the implementation from [28]. In this implementation the centrality encoder is adapted from using only explicit neighbours to including both explicit atoms and implicit hydrogens. As a result of the combination of this modified centrality encoding together with the usual atom type encoder, the hybridization of atoms is handled implicitly. For this reason this implementation does not present any edge encoder component.

Datasets. For pretraining, we used a publicly available dataset [13] consisting of 136k organic molecules and containing, among other things, atomic properties calculated with quantum chemistry methods. Each molecule is represented by a single conformer generated using the Merck Molecular Force Field (MMFF94s) in RDKit library. The initial geometry for the lowest-lying conformer was then optimized at the GFN2-xtb level of theory followed by refinement of the electronic structure with DFT (B3LYP/def2svp). Notice that while the 3D structure is used for the computation of the properties, this is not used in the model where the molecule is represented as 2D input (graph). The advantage of the described dataset is several reported atomic properties: charge, electrophilicity, and nucleophilicity Fukui indexes and an NMR shielding constant. The same set of molecules was used for masked node pretraining. Another pretraining dataset, PCQM4Mv2, consists of a single molecular property per molecule, a HOMO-LUMO gap that was also calculated using quantum chemistry methods https://ogb.stanford.edu/docs/lsc/pcqm4mv2/. It was curated under the PubChemQC project [23]. For the benchmarking of the obtained pretrained models, we used the absorption, distribution, metabolism, excretion, and toxicity (ADMET) group of the TDC dataset, consisting of 9 regression and 13 binary classification tasks for modeling biochemical molecular properties https://tdcommons.ai/benchmark/admet_group/overview/.

Atom-Level Quantum Pretraining. The pretraining on atom-level quantum mechanical properties is achieved via regression task. In the model, each node corresponds to a heavy (non-hydrogen) atom. Accordingly, the obtained node embeddings, are used to train atom-level properties via a linear layer. The model is trained on the dataset from [13] on each one of the available atomic properties, as well as on all of them at the same time in a multi-task setting. As a result, we obtain from this pretraining 5 different pretrained models. In each case except for HOMO-LUMO gap the model was trained as a regression task using L1 loss. A batch size of 100 was used with a fixed learning rate of \(10^{-4}\). In the case of HOMO-LUMO pretraining a triangular cyclic scheduling was employed with a minimum value of \(2\times 10^{-5}\) and a maximum value of \(2\times 10^{-4}\). The training was stopped using an early stopping criterion with patience of 100 epochs. For what concerns labels, the properties were not scaled except for a constant scaling factor of \(10^{-2}\) for NMR shielding constants as we observed it to helped convergence.

Molecule-Level Quantum Pretraining. The pretraining on molecular quantum properties is achieved via a simple regression task where the output is obtained by applying a linear layer to the class token embedding at the last layer of the network. The model is trained on the modeling of HOMO-LUMO gap on the PCQM4Mv2 dataset. We used the same training hyperparameters as the ones indicated in 2. As a result of this pretraining we obtain an additional pretrained model to consider for the downstream tasks.

Masking. Masking pretraining is carried out in a similar way to what is usually done in BERT-based models [9, 11]. This procedure entails randomly masking 15% of the input graph node tokens by replacing them with the mask token, and then training the model to restore the correct node type from the masked embedding as a multi-class classification task. The model is trained on the molecular structures present in the dataset used for atomic QM properties. As a result, we obtain one additional pretrained model to consider for the downstream tasks. The hyperparmeters used for this pretraining are the same as the ones used in 2, while the loss employed is a cross entropy loss.

Downstream Tasks. The training and testing on downstream tasks is carried out on the ADMET group from the TDC dataset in the same way as any molecular property modeling. For splittings and evaluation metrics we follow the guidelines of the benchmark group that we consider, hence we refer to [16]. The pretrained models are fine tuned for each downstream task by training without freezing any layer. Additionally, we also train a model from scratch, obtaining a total of 8 final models per each of the 5 default train/validation splitting seeds on each task (considering 22 tasks, 5 seeds and 8 models we obtain a total of 880 models). The hyperparameters used in each downstream task are the same: the batch size used is 32, while for what concerns the learning rate a triangular cyclic scheduling was employed with a minimum value of \(2\times 10^{-5}\) and a maximum value of \(2\times 10^{-4}\). The training is stopped with an early stopping criterion with patience of 200. The loss used for regression tasks is L1 loss, while for classification tasks a censored regression approach is used using again L1 loss with right censor set at 0 for negative examples and left censor set at 1 for positive examples. For what concerns regression labels, given the diversity of the tasks we opted for a standard scaling. Finally, the performances on each task’s test set are obtained per each pretraining case by taking mean and standard deviation of the performances obtained by the 5 models coming from the 5 different training/validation splits.

Spectral Analysis of Attention Rollout. To have a better understanding of the mechanism behind the pretrained models’ improvements, we shift our focus on the analysis of attention weights. What we aim to understand is along which directions the input molecular representation is decomposed when passed through a given model. In order to do so we start by considering the Attention Rollout matrix [1] \(\tilde{A}\) as a proxy for the model’s action on the input. While this approximation is a strong one, as we will see it provides a number of non-trivial insights. For the definition of \(\tilde{A}\) we refer to [1]. We start by considering a simple spectral decomposition of \(\tilde{A}\) (from here on we will make use of the bra-ket notation):

$$\begin{aligned} \tilde{A} = \sum _{i=0}^{N-1} a_i\mathinner {|{a_i}\rangle }\mathinner {\langle {a_i}|} \end{aligned}$$
(1)

with \(a_i\in \mathbb {C}\) and \(|a_0|\ge |a_1|\ge ...\ge |a_{N-1}|\) and, based on an empirical observation on one of the pretrained Graphormers (see Fig. 1), we analyse the similarity of the eigenvectors \(\mathinner {|{a_i}\rangle }\) with the eigenvectors of the Laplacian matrix L of the input molecular graph decomposed as

$$\begin{aligned} L = \sum _{i=0}^{N-1} l_i \mathinner {|{l_i}\rangle }\mathinner {\langle {l_i}|} \end{aligned}$$
(2)

with \(l_0\le l_1\le ...\le l_{N-1}\). In particular, by considering the overlap matrix \(C_{ij} = |\mathinner {\langle {l_i|a_j}\rangle }|\) we study both how many Laplacian modes are used as models’ eigendirections as well as how relevant they are as fraction of the non-trivial spectrum of \(\tilde{A}\) (by non-trivial we mean \(i\ne 0\) as by construction \(|\mathinner {\langle {l_0|a_0}\rangle }|=1\) for properties of L and \(\tilde{A}\)). This fraction is quantified by considering \(\eta = \frac{\sum _{i\in \mathcal {U}\setminus 0} |a_i|}{\sum _{i=1}^{i=N-1} |a_i|}\) where \(\mathcal {U} = \{j | \max _j C_{ij}\ge 0.9\; \textrm{for}\; i \in \left( 0,1,2,...,N-1\right) \}\) with 0.9 being a chosen arbitrary threshold for similarity. Based on these quantities, we define a metric that factors everything together as:

$$\begin{aligned} \zeta = \eta \sum _{i=1}^{N-1}\varTheta \left( \max _{j}C_{i,j}-0.9\right) \end{aligned}$$
(3)

where \(\varTheta \) is the Heaviside function. We then evaluate \(\zeta \) averaged over the test set of each downstream task reporting per each architecture the distribution across tasks for fixed pretraining condition, and also analyse for every task if the model ranking in peformance correlates with the ranking coming from the evaluation of the average \(\zeta \) over that test set. Finally, for this reason we make use of the Spearman’s rank coefficient and consider performances as the higher the better (e.g. we consider MAE with a negative sign but ROC-AUC with positive sign).

3 Results and Discussion

Table 1. Global results obtained from the ADMET group of TDC are presented. Each row corresponds to a specific task, along with the metric used for evaluation. Columns represent different pretrainings considered. Highlighted values denote the best performance achieved among our models, based on the average value as per ranking criterion form the TDC leaderboard. Additionally, cases where our results surpass the top-performing model in the TDC leaderboard are marked with an asterisk (*).

Pretraining on Atom-Resolved Tasks Give the Best Overall Performances. Model performances obtained for the downstream tasks are summarized in Table 1. The table reveals, among other things, that the models trained from scratch or pretrained on the HOMO-LUMO gap (molecular proeperty) are never among the top performers. The superior performance of the models pretrained on atom-level properties is remarkable considering that the HOMO-LUMO gap dataset contains \(\sim 20\) times more molecules than present in the dataset used for pretraining on atomic QM properties and for masking.

The Right Atomic QM Pretraining Usually Gives the Best Performances. In the same table it is also possible to count which pretraining gives most frequently the best results. Despite the fact that results can be quite close, if we rank the models by average value of the metric as done in the TDC leaderboard, the models that demonstrate the highest number of top performances were pretrained using all the atomic QM properties with 10 and pretrained with masking with 6 top results, respectively. Atom-level QM pretraining as a group reveals even higher superiority over studied alternatives: that is in 17 out of 22 downstream tasks the correct choice of atom-level QM pretraining provides the top performant model.

Fig. 2.
figure 2

Violin plots of the average value of \(\zeta \) on each task in the ADMET group of TDC (\(\langle \zeta _{task}\rangle \)) for every Graphormer model considered.

Fig. 3.
figure 3

Spearman’s rank coefficient \({\textbf {r}}_S\) between the average \(\zeta \) value per each model and the correspondent performance per each task in the ADMET group of TDC.

Atom-Level QM Pretraining Boosts the Spectral Perception of Molecules. We evaluate the metric \(\zeta \) defined in Eq. 3 as described in the Sect. 2 obtaining a distribution of 22 values over the downstream tasks per each pretraining. The result is reported in Fig. 2 as a set of violin plots. Firstly, we clearly see that models trained from scratch or pretrained on HOMO-LUMO gap present values of \(\zeta \) that are close to 0 indicating little to no presence of non-trivial Laplacian eigenmodes in the spectrum of their \(\tilde{A}\) matrix. On the contrary, every atom-resolved pretrained model (including masking) presents nonzero values of \(\zeta \) across the downstream tasks raging from \(\sim 1\) to \(\sim 5\). Within these last group of models we can clearly notice how pretraining on the atom-level QM properties provides the strongest boost in perception of graph Laplacian eigenmodes. In particular, the model pretrained using all properties in a multi-task fashion and using only NMR data present the highest values of \(\zeta \), followed by the models pretrained on charges, nucleophilic and electrophilic Fukui functions.

A Better Spectral Perception of Molecules Usually Correlates with Better Performances. As described in the Sect. 2, we proceed to analyse the Spearman’s rank coefficient \({\textbf {r}}_S\) between \(\zeta \) and performances in each task using the 8 datapoints coming from the different pretraining methods. The results are reported in Fig. 3. We can see that for most tasks (20 out of 22) the value of \({\textbf {r}}_S\) is positive, with 13 tasks presenting \({\textbf {r}}_S\ge 0.5\) and 8 tasks presenting \({\textbf {r}}_S\ge 0.75\). These results are a strong indication that models with a better spectral perception of the molecular graph also demonstrate better performances across different tasks.

4 Conclusions

A Graphormer neural network was pretrained on several tasks to improve its performance in modelling molecular ADMET properties that are relevant to drug discovery using the TDC dataset containing 22 downstream tasks. It was found that out of studied methods, pretraining on atom-level QM properties such as atomic charges, NMR shielding constants and Fukui indexes, or using a masking task similar to the one used in BERT model, significantly improve the performance in comparison to the non-pretrained model. One of atom-level QM property pretraining tasks was found to yield the best results for 17 out of 22 downstream tasks. For comparison, pretraining on a much larger dataset of calculated HOMO-LUMO gaps, a molecular electronic property, brings little or no improvement. Finally, through a spectral analysis of the Attention Rollout matrices, we showed how pretraining on atom-level QM properties improves the model perception of spectral properties of the input molecular graph. In particular, by defining an appropriate metric, we show that this effect correlates with the model performance on most of the downstream tasks.