Supervised topic modeling for predicting... | F1000Research
ALL Metrics
-
Views
Get PDF
Get XML
Cite
Export
Track
Method Article

Supervised topic modeling for predicting molecular substructure from mass spectrometry

[version 1; peer review: 2 approved with reservations]
PUBLISHED 19 May 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cheminformatics gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Small-molecule metabolites are principal actors in myriad phenomena across biochemistry and serve as an important source of biomarkers and drug candidates. Given a sample of unknown composition, identifying the metabolites present is difficult given the large number of small molecules both known and yet to be discovered. Even for biofluids such as human blood, building reliable ways of identifying biomarkers is challenging. A workhorse method for characterizing individual molecules in such untargeted metabolomics studies is tandem mass spectrometry (MS/MS). MS/MS spectra provide rich information about chemical composition. However, structural characterization from spectra corresponding to unknown molecules remains a bottleneck in metabolomics. Current methods often rely on matching to pre-existing databases in one form or another.  Here we develop a preprocessing scheme and supervised topic modeling approach to identify modular groups of spectrum fragments and neutral losses corresponding to chemical substructures using labeled latent Dirichlet allocation (LLDA) to map spectrum features to known chemical structures. These structures appear in new unknown spectra and can be predicted. We find that LLDA is an interpretable and reliable method for structure prediction from MS/MS spectra. Specifically, the LLDA approach has the following advantages: (a) molecular topics are interpretable; (b) A practitioner can select any set of chemical structure labels relevant to their problem; (c ) LLDA performs well and can exceed the performance of other methods in predicting substructures in novel contexts.

Keywords

Metabolomics, Machine Learning, Mass Spectrometry, Structure Identification, LC-MS, Tandem Mass Spectrometry, MS/MS

Introduction

Liquid chromatography - tandem mass spectrometry (LC-MS/MS) is a powerful experimental method for identifying the small molecule metabolites in a sample of unknown composition. It provides detailed structural information from a given molecule with the only prerequisite knowledge being the parent molecule’s mass-to-charge ratio. This is especially important since a vast portion of naturally occurring small molecules are believed to remain unidentified1. Yet identifying the structure of a molecule given its MS/MS remains challenging2. Traditionally, such identification is done by hand, which is difficult, time-consuming, and poses significant reproducibility issues. The repetitive nature of this task naturally lends itself well to a computational approach3.

Existing tools generally follow one of two trends. In spectral library matching, a query MS/MS spectrum is compared to reference spectra via similarity metric such as a cosine score4. For example, the National Institute of Standards and Technology (NIST) Mass Spectral Library, Human Metabolome Database (HMDB), Metlin, and the Global Natural Products Social Molecular Networking (GNPS) databases all offer spectral similarity search functionalities using spectral similarity scores58. However, spectral databases tend to be sparse compared to the universe of possible molecular structures3. Simulated spectra can supplement databases with new structures, though matching is usually done on entire spectra911. Molecular fingerprint approaches in which vector representations of molecules are predicted directly from MS/MS spectra are increasingly popular methods1216.

This work focuses on the specific fingerprint task of substructure prediction in MS/MS spectra (Figure 1A). Chemical substructures are defined structural subunits that appear across different molecules and are useful for identifying their parent molecules13,18. Focusing on substructures benefits practitioners because substructures have directly interpretable chemical meaning. Further, predicting chemical substructures rather than relying on spectral library matching allows for novel predictions of molecules not seen in these libraries. An existing approach, MS2LDA, uses topic modeling to find regularly co-occurring sets of MS/MS spectrum fragments and neutral losses19. Here we build on this work by proposing and developing a supervised topic modeling approach, using labeled latent Dirichlet allocation (LLDA)17, to find co-occurring spectrum features associated with chemical substructures of the user’s choosing (Figure 1B–C). We find that compared to alternative methods for supervised substructure identification, LLDA performs well and provides interpretable substructure predictions. Using cosine distance k-nearest neighbors (k-NN) for spectral library matching, we find that LLDA’s relative performance improves as the test set becomes more chemically distinct from the training set and as the substructures being predicted appear with different frequencies between the two sets. This type of generalization is important for applications where the molecular identity of samples is unknown and thought to correspond to novel molecules.

cfa3aeb1-2e58-46e5-b28a-49982ea3a038_figure1.gif

Figure 1. Labeled latent Dirichlet allocation (LLDA) for interpretably predicting structure in tandem mass spectrometry (MS/MS).

(a) Substructure prediction in MS/MS spectra. A tandem (MS/MS) mass spectrum of a small molecule (ergonovine). Spectrum fragments and neutral losses provide information relevant to identifying chemical structure. The goal of the supervised topic modeling approach is to assign interpretable substructure scores for a spectrum via LLDA17 (b) Supervised learning for substructure prediction. A training set is composed of MS/MS spectra with known chemical structures and a choice of substructure (topic) labels. Each spectrum is labeled with its parent molecule’s substructures. (c) The LLDA generative model. The model is defined in 17 and labeled here as it relates to MS/MS substructure prediction. The model is trained on a corpus D of MS/MS spectra and set of molecular substructures K. Given the fragment/neutral loss (feature) composition Wd,n of each spectrum and the substructure-spectrum label matrix Λ, the model will find the substructure-feature probability matrix β, the spectrum-substructure probability matrix θ, and the feature-substructure assignment matrix Z. α and η are Dirichlet parameters for the distributions from which θ and β are respectively assumed to have been drawn. Φ is a Bernoulli parameter for the distribution from which Λ is assumed to have been drawn. is a model. Figure (c) has been reproduced and modified with labels with permission from 17.

Methods

Data sets

In order to facilitate direct comparison to an existing tool, we train and test LLDA using the same data and evaluation metrics as those used in the metabolite substructure auto-recommender (MESSAR) developed in 20. We utilize the MESSAR target training corpus of 3,146 positive mode LC-MS/MS spectra (available here) originally from the Global Natural Products Social Molecular Networking database8. For validation, we use the 5,164 MASSBANK21 validation spectra and the 185 annotated test spectra as our test set both used in 20 and available here. This test dataset contains spectra for 34 drugs and 126 metabolites from MASSBANK and 25 spectra from the Critical Assessment of Small Molecule Identification 2017 contest. The 712 unique substructures provided in the rule database in 20 (available here) are used. All spectra are provided in .mgf format. Copies of the spectra and a cleaned version of the unique substructure set are included in this publication’s Underlying data22.

Data preprocessing

To model a mass spectrum using LLDA, it is necessary to represent a mass spectrum as a bag-of-words “document”23. First, any fragment having a mass-to-charge-ratio (m/z) below 30 is discarded to remove structurally uninformative fragments. Each fragment in the remaining spectrum is then assigned the molecular formula with the closest theoretical m/z and that satisfies two conditions (i) the formula’s theoretical m/z is within 0.1 of the peak’s m/z and (ii) the formula is a subformula of the parent spectrum’s parent molecular formula (Figure 2A). We assume all MS/MS spectra have a corresponding molecular formula, a reasonable assumption for a spectrum even if its parent chemical structure is unknown24. This formula matching is done using the rcdk library (version 3.5.0). Peaks with no formula match are discarded. Next, all possible neutral losses are computed as pairwise differences between kept spectrum fragments. A neutral loss consists of a m/z difference between two fragments, fa and fb. Neutral losses in which fa has both a greater intensity and m/z than fb are assigned the mean of its two respective peak intensities and matched to formulas in the same manner as fragments. Neutral losses with no matching formulas are discarded. The word count Y for a given spectrum feature (fragment or neutral loss) with intensity x is calculated using a generalized logistic function25:

cfa3aeb1-2e58-46e5-b28a-49982ea3a038_figure2.gif

Figure 2. Tandem mass spectrometry (MS/MS) data preprocessing and bag-of-words conversion.

(a) Mapping spectrum features to molecular formula words. For representing a mass spectrum as a document, each spectrum fragment corresponding to a peak is first mapped to a molecular formula by finding the formula that matches the following criteria: (i) the theoretical mass-to-charge ratio of the formula is within 0.1 m/z of the observed fragment. (ii) the formula is a subformula of the spectrum’s parent formula. The closest such m/z formula is kept, the same process is repeated for neutral losses, and spectrum features with no formula matches are discarded. (b) Mapping feature intensity to word counts. Word counts are computed for each spectrum fragment and neutral loss using the feature’s normalized intensity as in Equation 1. A word count response for Q = 10 and B = 0.1 is illustrated here.

Y(x)=100(1+QeBx)2(1)

where Q determines the threshold value at intensity x = 0 and B determines the growth rate of the word count growth rate as the intensity increases. Input intensities of spectrum fragments are normalized such that the maximum raw intensity is 100, with word counts rounded to the nearest integer. An example intensity response function for intensity values in the range [0, 100] is shown in Figure 2B.

Using the data and code provided, this document generation process including formula mapping and bag-of-words conversion can be run using the command:

python make_documents.py \
--in_mgf <in_mgf_file> \
--out_dir <out_documents_dir> \
--eval_peak_script evaluate_peak.R \
--n_jobs 1 \
--adduct_element H \
--adduct_element 1

We note that this process can be quite time consuming if run on a single core - preprocessing a single spectrum using a single core can take more than two minutes depending on the spectrum. As such, the preprocessed spectra for the specified data sets have been provided (see Underlying data22). Runtime can be decreased on a multi-core system by increasing the n_jobs input parameter.

Substructure labels for training spectra having known parent chemical structures are assigned using the RDKit library (version 2020.09.5). Training spectra are labeled with the substructures in the user-defined set that are present in the given spectrum’s parent structure. Substructure labeling can be run using the provided data and code using the command:

python prep_data.py \
--train_mgf train.mgf \
--test_mgf test.mgf \
--validate_mgf validate.mgf \
--df_substructs df_substructs.tsv \
--embedded_spectra_out <embedded_spectra_fileName_out.npz> \
--df_labels_out <df_labels_fileName_out.tsv> \
--df_ids_out <df_ids_fileName_out.tsv>

Training LLDA and predicting substructures in a new spectrum

The LLDA model is implemented using the Python Tomotopy library (version 0.10.2)26. The model takes a set of training spectra, test spectra, and substructures as input. The spectra must be in bag-of-words format, and the training spectra must have been labeled with ground truth substructures as described above. The original LLDA model is described in full in 17. We note that every component of LLDA for modeling text documents has an analog useful for modeling a MS/MS spectrum. A document is an MS/MS spectrum, words are spectrum features (fragments and neutral losses), topics are commonly co-occurring spectrum features, and tags are chemical substructures (Figure 1). LLDA was trained for 2,000 iterations, since model perplexity tended to converge at around 2,000 iterations across various data sets generated using the preprocessing scheme. We note that the number of necessary iterations may differ across data sets. In addition to including a required input argument for the number of iterations in the LLDA model, we also include an optional flag to record and plot model perplexity.

Substructure predictions in test spectra are calculated using cosine similarity. The cosine similarity between a new spectrum i and substructure j is calculated as:

sim(k,d)=vkvdvkvd(2)

Here vk is the word distribution for topic k and vd is the word count in document d that appears in the training corpus. Both vectors inherit their lengths from the number of words present across all documents. In this manner, every substructure is assigned a score for presence/absence in a test spectrum. After preprocessing, the LLDA model can be trained and tested using the commands

python run_llda.py \
--Q <Q> \
--B <B> \
--out_dir <llda_out_directory> \
--train_mgf train.mgf \
--test_mgf test.mgf \
--documents_dir documents \
--df_substructs df_substructs.tsv \
--df_labels df_labels.tsv \
--num_iterations <number_iterations>

K-nearest neighbors

For comparison to the practice of spectral library matching, we also implement a k-nearest neighbors method for substructure prediction. Spectra are represented as vectors by assigning fragments to m/z bins of width 0.1. Only fragments with mass-to-charge ratio in the range [0, 1000] are kept; the vectors representing the spectra are thus of length 10,000. To make a prediction for substructures associated with a new spectrum using the k-nearest neighbors algorithm, the k nearest neighbors in the training set of spectra are computed using the cosine similarity between the vectors corresponding to spectra in the training set. The score for each substructure in the new spectrum is calculated as the mean of the substructure presence and absence labels in the k nearest neighbors’ spectra. For example, for a single substructure s and test spectrum d, if 4 of the 5 nearest neighbor spectra contain s, the predicted score for s in d will be 4/5 = 0.8. After data preprocessing, k-NN can be run using the command:

python run_knn.py \
--df_ids <df_ids_filename.tsv>
--df_labels <df_labels_filename.tsv> \
--k 10 \
--embedded_spectra <embedded_spectra_fileName.npz> \
--out_dir <knn_out_directory>

Further code documentation including required libraries and optional arguments can be found in the README file included in this publication’s Underlying data22. A Docker image containing all necessary software and library requirements is also available at Docker Hub (see Extended data).

Results and discussion

Preprocessing hyperparameter search

As described in the preprocessing pipeline, the generalized logistic function (GLF) is used to convert normalized feature intensity values into word counts to represent spectra as documents. To evaluate the hyperparameter choices in the GLF, LLDA was evaluated using a range of Q and B values corresponding to a wide range of intensity response functions. This performance was measured using the same metric as in 20. Specifically, the metric, denoted as t3≥2, measures the number of spectra in which at least two of the top-3 scoring substructures are true positives. This is similar to the standard metric of recall used in recommender systems27. In this experiment, LLDA predicted substructures most sensitively for Q and B values corresponding to binarized word counts. In other words, for a given document, words that appear in the given spectrum receive a word count of 1 and all other words receive a word count of 0; this corresponds to Q = 10.0 and B = 0.001 (Figure 3A). This finding deserves further research and means that in this formulation, intensity information may harm model performance. This could be for a number of reasons. One possibility is that the GLF formulated in Equation 1 is not an optimal choice of function to map intensities to word counts. There are many options of function for this conversion, the GLF was chosen because of its flexibility to produce a diverse set of intensity-word count relationships. Another possibility is that neutral loss intensity encoding must be revisited. There may be better methods of representing a neutral loss’s intensity other than taking the mean of its two constituent fragments. Additionally, valuable information might potentially be destroyed when raw ion counts are converted to normalized intensities as is standard practice14,18.

cfa3aeb1-2e58-46e5-b28a-49982ea3a038_figure3.gif

Figure 3. Preprocessing-optimized labeled latent Dirichlet allocation (LLDA) performs competitively in substructure identification and generalizes in difficult test sets.

(a) Empirical study of normalized intensity response function. The performance of the LLDA model using the intensity response shown in Figure 2b with various values of Q and B on the validation set of 5,164 MASSBANK spectra used in 20. Performance is measured by counting the number of spectra in which at least 2 of the top-3 predicted substructures are correct (t3≥2). The binarized response function performs best (Q = 10, B = 0.001), meaning that a generative process that does not include normalized intensity information outperforms other choices in this formulation. (b) LLDA outperforms k-nearest neighbors (k-NN) in generalizing to evaluation sets with less overlap in substructure or chemical similarity. The difference between LLDA performance (measured by the area under the receiver operating characteristic or AUC) and k-NN performance is computed. The average chemical similarity is computed between the train and test sets using the RDKit fingerprint. Colors represent the substructure appearance difference between the training and test sets. Increasing the difficulty of generalization along either axis of molecular similarity or substructure overlap increases LLDA performance relative to the k-NN model. (c) LLDA can recognize substructures outside train context. A test spectrum is shown in which substructure 528 appears. The parent structure (Pubchem ID 57509371) is displayed with the portions of the structure containing substructure 528 highlighted. Fragments in red are in the 95th percentile of words in the topic for substructure 528 in the LLDA. The k-NN model performs poorly on such substructures that appear out of context in the test set, while LLDA maintains predictive performance in this case.

Substructure identification benchmark

LLDA, with the best-performing values of Q and B (Q = 10.0, B = 0.001, corresponding to binary word counts) was then run on the benchmark test dataset from 19, consisting of 185 spectra. On this data, the authors of 20 report the following results for MESSAR on the top 3 recommended substructures for each spectrum: 79 spectra in which at least 1 recommendation is correct (t3≥1) and 40 spectra in which at least 2 recommendations are correct (t3≥2). The LLDA model yields 125 cases for t3≥1 and 82 cases for t3≥2.

The k-nearest neighbors method was studied using k = [1, 5, 10] on the same set of evaluation data. For these experiments, we find that the t3≥1 metric is [111, 128, 130] respectively and that t3≥2 is [74, 117, 121]. These results indicate that the k-nearest neighbors method with k=10 outperforms all other methods on this set of train/test data according to these metrics. These results are shown in Table 1. However, it is unclear whether it is possible to extract an analogue to ‘topics’ from the k-nearest neighbors algorithm, as is common in LLDA and topic models. Further research is needed to develop methods that perform as well as k-nearest neighbors while retaining the interpretability and modularity of topic modeling approaches to substructure identification, such as LLDA. As the output of machine learning methods for mass spectrometry data is typically inspected by a human in an experimental procedure, developing interpretable methods remains important. We note that the same train spectra, test spectra, and evaluation metrics as used in 20 were used in this work. This was done to maximize the comparability between methods. Future development of community-wide standards for benchmark datasets and evaluation metrics will greatly facilitate development of new methods.

Table 1. Performance of labeled latent Dirichlet allocation (LLDA) and cosine distance k-nearest neighbors on test data from 20.

Methodt3≥1t3≥2
MESSAR207940
LLDA12582
K-nearest neighbors (k=1)11174
K-nearest neighbors (k=5)128117
K-nearest neighbors (k=10)130121

Ablation study

To study how well k-nearest neighbors performs compared to LLDA on novel test molecules poorly represented in the training set, increasingly difficult subsets of the test data were constructed using two notions of how data might be limited in a real-world experimental setting (i) chemical similarity between training and test molecules and (ii) differential train-test appearance for substructures.

For chemical similarity, the RDKit fingerprint similarity was calculated between each test set of spectrum parent structure and all structures in the training set. These similarity scores were then used to set similarity thresholds for selecting increasingly difficult subsets of the test data. For example, a maximum similarity threshold of 0.4 produced a subset of the test spectra in which the maximum pairwise structure fingerprint similarity to the train structures is at most 0.4. For substructure appearance, the number of times a substructure appears in the training set is counted and normalized according to the number of spectra. The same is done for the test set. Next, substructures are ordered by these differences and percentile cutoffs are used to produce subsets of test spectra of increasing difficulty in terms of test - train appearance. For example, a 60th percentile cutoff means testing only substructures whose normalized test minus train appearance values are above the 60th percentile of all such values across all substructures. The performance of the LLDA model compared to the k-NN model was calculated as the LLDA area under the receiver operating characteristic curve (AUC) averaged across the given test set minus the average AUC for the k-NN model.

The results of this study are shown in Figure 3B. Increasing difficulty along either axis of chemical similarity or substructure overlap improved the performance of the LLDA model relative to the k-NN model. This effect was especially pronounced as test-train molecular similarity became more distant. These results indicate use cases in which an approach such as LLDA may be especially useful compared to spectral library searching. LLDA may be able to better recognize novel chemical configurations of substructures in new spectra. As such it may be a better-suited model for characterizing spectra measuring molecules coming from new and understudied areas of chemical space not well represented in spectral libraries.

Conclusions

Metabolites are central in biology, yet the vast majority are likely unidentified28. Untargeted metabolomics via LC-MS/MS is a promising option for identifying new metabolites in a high throughput pipeline. Improved computational methods for identifying chemical structure from MS/MS spectra are needed for this promise to become a reality. With this in mind, we developed, described, and open-sourced a supervised topic modeling method for identifying chemical substructures in tandem mass spectrometry data via LLDA17. In a series of empirical studies, this supervised topic model was trained and tested on publicly available benchmark data and substructures, and LLDA was compared to an alternative method, MESSAR20. A k-nearest neighbors (k-NN) was also implemented as a means of testing spectral library matching to predict substructure labels based on neighbor averages.

We report several benefits of the LLDA method. First, when trained and tested using the same spectra, LLDA provides interpretable, probabilistic substructure topics and performs well using the same metrics as in 20. These topics can incorporate a large number of spectrum fragments and neutral losses, so the patterns of spectrum fragments and neutral losses associated with substructures can be as complex as necessary for good predictive performance. LLDA is a probabilistic model that can compensate for ambiguity, redundancy, and other noise that arises from computing substructure labels. The advantage of such a probabilistic method is that substructure labels often have significant overlap12. Finally, the LLDA model offers a flexible supervised framework. A practitioner may choose any set of substructure labels on which to train the LLDA model, allowing the user to tailor the output of the model to a specific application requiring accurate and interpretable substructure identification. LLDA offers a supervised topic modeling approach that complements both the benefits of MS2LDA19, circumventing the need to pick an arbitrary number of unsupervised topics or to map output topics to substructures - this relationship is predetermined by the user.

By systematically exploring the preprocessing pipeline used to map spectrum features to a representation of spectra as documents, we find that this LLDA model performs best when intensity information is hidden from the model and binary word counts are used to represent MS/MS spectra (Figure 3A). This aligns with existing work in probabilistic topic models used in recommender systems, such as collaborative topic Poisson factorization, where binarized ratings lead to improved predictive performance29. However, we also note that this effect may arise from the choice of train, test, and validation sets that were taken from publicly available spectra with potential heterogeneity in collision energies. We further note the possibility that a different function for translating intensities to word counts, especially for neutral losses, may result in better performance. Neutral losses may be more robust against systemic uniform measurement error in a spectrum and as such could represent more stable sources of information. However, the manner in which intensities are assigned to neutral losses and which neutral losses are kept will heavily affect model performance.

The k-nearest neighbors (k-NN) model with k=10 performs very well using the same data and evaluation metrics, raising important considerations about trade-offs between predictive performance of substructure identification and interpretability of results. We find that the k-NN model may perform well for the task of substructure identification in situations in which substructures appear in similar patterns between train and test sets. Similarly, k-NN may perform well when a test spectrum corresponds to a molecule that is similar to those in the train set. But as explored in Figure 3B, the k-NN model suffers when these similarities are reduced, and the test spectra correspond to increasingly different molecules from those in the train set or when the substructure appearance varies between train and test sets. This observation is important to consider in the development and evaluation of machine learning methods for substructure identification; assessing generalization performance in real-world settings should reflect cases in which a new spectrum is coming from an underexplored area of chemical space that is not well represented by existing spectral libraries. We leave to future work the problem of including such quantities into evaluation metrics to more accurately assess generalization. Ablation studies such as the one presented here can provide the foundation for better metrics and ways of construct evaluation sets relevant to a substructure identification task in a specific problem setting.

An example case study: the substructure with identifier 528 in the unfiltered train/test set corresponds to SMARTS string C1Cc2ccccc2CN1 and appears in one test spectrum. In this test spectrum, substructure 528 appears without many of its co-substructures from the training set, and these co-substructures from the training set appear in the test set without substructure 528. As such, the k-NN (k = 10) model produces false positive predictions for substructure 528 in the test set, resulting in an AUC of 0.43. But the LLDA model picks up on this substructure’s presence in the test set without suffering from false positives, achieving an AUC of 0.99. The spectrum (corresponding to spectrum 128 in the test set) containing substructure 528, its structure, and spectrum fragments in the top 0.95 percentile of substructure 528’s LLDA topic are shown in Figure 3C.

Further work is required to better optimize LLDA. Additional preprocessing steps can be explored, such as keeping spectrum fragments that do not map to a child molecular formula but appear consistently across spectra (rather than discarding them as described in the Method). The inclusion of such orphaned fragments could improve downstream ranking of substructures. A similar preprocessing step is effective in processing amplicon sequencing data30. A number of limitations remain in LC-MS/MS metabolite identification including a paucity of training data31, difficulties in selecting a vocabulary of substructures12, and the heterogeneity involved in instrument choice, acquisition method, and sample conditions. Performance drops resulting from these issues are often difficult to disentangle from features of the data that result solely from chemical structure. Future work includes optimization of substructure label sets, incorporation of prior knowledge such as ionization mode or instrument type, and testing these models on a larger dataset. The work presented here highlights the increasing interest and relevance of representing MS/MS spectra in manners that can capture more information than a mass-to-charge ratio binning scheme and applying a metric for predictions such as cosine similarity. Another recent example of such an effort can be found in 32.

By releasing all code and preprocessed training data for the methods developed here, we encourage reproducibility efforts alongside the development of machine learning methods to solve de novo identification of unknown metabolites in LC-MS/MS data as computational metabolomics stands to benefit from community engagement, consistent public evaluation methods, and open-source models.

List of abbreviations

MS/MS: Tandem mass spectrometry

LC-MS/MS: Liquid chromatography - tandem mass spectrometry

m/z: mass-to-charge ratio

LLDA: Labeled latent Dirichlet allocation17

MESSAR: Metabolite substructure auto-recommender20

k-NN: K-nearest neighbors

AUC: Area under the receiver operating characteristic curve

GLF: Generalized logistic function

Data availability

Underlying data

Zenodo: MS2 LLDA Topic Model.

https://doi.org/10.5281/zenodo.458965322.

This project contains the following underlying data:

  • readme.md (file containing descriptions, specifications, and instructions for included data and using code).

  • requirements.txt (exported conda environment containing the necessary Python dependencies – R decencies must be installed separately. See readme.md).

  • data.zip (data used in paper’s empirical study, see readme.md).

  • code.zip (code for preparing data, running LLDA, and running kNN. See readme.md).

Data are available under the terms of the Apache License 2.0.

Extended data

Docker image: https://hub.docker.com/r/gkreder/ms2-topic-model

Analysis code available from Github: https://github.com/gkreder/ms2-topic-model

Archived code at time of publication: https://doi.org/10.5281/zenodo.458965322.

License: Apache License 2.0.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 19 May 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Reder GK, Young A, Altosaar J et al. Supervised topic modeling for predicting molecular substructure from mass spectrometry [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10(Chem Inf Sci):403 (https://doi.org/10.12688/f1000research.52549.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 19 May 2021
Views
0
Cite
Reviewer Report 06 Sep 2021
Hunter Moseley, Department of Molecular & Cellular Biochemistry, University of Kentucky, Lexington, KY, USA 
Approved with Reservations
VIEWS 0
This is a paper that describes the use of labeled latent Dirichlet allocation (LLDA) in combination with k-nearest neighbor selection of library spectra to develop a set of molecular substructure classifiers for MS/MS spectra.  While the paper is reasonably written ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Moseley H. Reviewer Report For: Supervised topic modeling for predicting molecular substructure from mass spectrometry [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10(Chem Inf Sci):403 (https://doi.org/10.5256/f1000research.55846.r92355)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
0
Cite
Reviewer Report 11 Aug 2021
Joe Wandy, Glasgow Polyomics, University of Glasgow, Glasgow, UK 
Approved with Reservations
VIEWS 0
Predicting substructural information from LC-MS/MS fragmentation data is an important problem in untargeted metabolomics. In this work, the authors introduced a novel workflow to preprocess fragmentation data and perform supervised topic modelling approach using labeled latent Dirichlet allocation (LLDA). The ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Wandy J. Reviewer Report For: Supervised topic modeling for predicting molecular substructure from mass spectrometry [version 1; peer review: 2 approved with reservations]. F1000Research 2021, 10(Chem Inf Sci):403 (https://doi.org/10.5256/f1000research.55846.r88758)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 19 May 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.