Abstract
We have developed a program NeuroText to populate the neuroscience databases in SenseLab (http://senselab.med.yale.edu/senselab) by mining the natural language text of neuroscience articles. NeuroText uses a two-step approach to identify relevant articles. The first step (pre-processing), aimed at 100% sensitivity, identifies abstracts containing database keywords. In the second step, potentially relveant abstracts identified in the first step are processed for specificity dictated by database architecture, and neuroscience, lexical and semantic contexts. NeuroText results were presented to the experts for validation using a dynamically generated interface that also allows expert-validated articles to be automatically deposited into the databases. Of the test set of 912 articles, 735 were rejected at the pre-processing step. For the remaining articles, the accuracy of predicting database-relevant articles was 85%. Twenty-two articles were erroneously identified. NeuroText deferred decisions on 29 articles to the expert. A comparison of NeuroText results versus the experts’ analyses revealed that the program failed to correctly identify articles’ relevance due to concepts that did not yet exist in the knowledgebase or due to vaguely presented information in the abstracts. NeuroText uses two “evolution” techniques (supervised and unsupervised) that play an important role in the continual improvement of the retrieval results. Software that uses the NeuroText approach can facilitate the creation of curated, special-interest, bibliography databases.
Similar content being viewed by others
References
Agresti A. (1990) Categorical Data Analysis, Wiley, New York, pp. 59–66.
Aronson A. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. Am. Med. Inform. Assn. Symp. Washington DC, pp. 17–21.
Baeza-Yates R. and Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison-Wesley, New York, pp. 99–114; 191–224.
Barde Y. A., Edgar D. and Thoenen H. (1982) Purification of a new neurotrophic factor from mammalian brain. EMBO. 1, 549–553.
Cantrell A. R., Smith R. D., Goldin A. L., Scheuer T., and Catterall W. A. (1997) Dopaminergic Modulation of Sodium Current in Hippocampal Neurons via cAMP-Dependent Phosphorylation of Specific Sites in the Sodium Channel a Subunit. J. Neurosci. 17, 7330–7338.
Capogna M., McKinney R. A., O’Connor V., Gähwiler B. H., and Thompson S. M. (1997) Ca2+ or Sr2+ Partially Rescues Synaptic Transmission in Hippocampal Cultures Treated with Botulinum Toxin A and C, But Not Tetanus Toxin. J. Neurosci. 17, 7190–7202.
Chen W. R. and Shepherd G. M. (1997) Membrane and synaptic properties of mitral cells in slices of rat olfactory bulb. Brain Res. 745, 189–196.
Chiu W. L. A. K., Sze C. N., Ip L. N., Chan S. K. and Au-Yeung S. C. F. (2001) NTDB: Thermodynamic Database for Nucleic Acids. Nucl. Acids Res. 29, 230–233.
Cicchetti D. V. and Feinstein A. R. (1990) High aggreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 43, 551–558.
Claiborne B. J., Amaral D. G., and Cowan W. M. (1986) A light and electron microscopy study analysis of the mossy fibers of the rat dentate gyrus. J. Comp. Neurol. 246, 435–458.
Crasto C. J., Marenco L., Miller P. L., and Shepherd G. M. (2002) Olfactory receptor database: a metadata driven automated population from sources of gene and protein sequences. Nucl. Acids Res. 30, 354–360.
Friedman C., Alderson P. O., Austin J. H., Cimino J. J., and Johnson S. B. (1994) A general natural language text processor for clinical radiology. J Am Med. Inform. Assn. 1, 161–174.
Friedman C., Jra P., Yu H., Krauthammer M., and Rzhetsky A. (2001) GENIES: a natural-language processing system for extraction of molecular pathways from journal articles. Bioinformatics. 17, S74-S84.
Hersh W. R., Crabtree M. K., Hickman D. H., et al. (2002) Factors Associated with Success in Searching MEDLINE and Applying Evidence to Answer Clinical Questions. J Am Med Inform Assn. 9, 283–293.
Iliopoulos I., Enright A. J., and Ouzounis C. (2001) TextQuest: Document Clustering of MEDLINE Abstracts for Concept Discovery in Molecular Biology, Pacif. Symp. Biocomp. 6, 374–383.
Justeson J. S. and Katz S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1, 9–27.
Karp P. D., Riley M., Paley S. M., Pellegrini-Toole A., and Krumenacker M. (1999) EcoCyc: Encyclopedia of Escherichia coli genes and metabolism. Nucl. Acids Res. 27, 55–58.
Kim W., Aronson A. R., and Wilbur W. J. (2001) Automatic MeSH term assignment and quality assessment Proc. Am. Med. Inform. Assn. Symb., Washington DC., pp. 310–323.
Korfhage R. R. (1997) Information Storage and Retrieval, John Wiley and Sons, New York, pp. 105–139, 191–215, 219–231.
Krauthammer M., Rzhetsky A., Morozov P., and Friedman C. (2000) Using BLAST for identifying gene and protein names in journal articles. Gene. 259, 245–252.
Lagus K. (2000) Text mining with the WEBSOM. Acta. Polytech. Scand. Math Comput. 110, 1–54.
Marenco L., Nadkarni P. M., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Neuronal database integration: the SenseLab EAV data model. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 102–106.
Migliore M., Morse T. M., Davison A. P., Marenco L., Shepherd G. M., and Hines M. L. (2003) ModelDB: Making Models Publicly Accessible to Support Computational Neuroscience. Neuroinformatics. 1, 135–140.
Mori K., Nowycky M. C., and Shepherd G. M. (1981) Electrophysiological analysis of mitral cells in the isolated turtle olfactory bulb. J. Physiol. (Lond.). 314, 281–294.
Mutalik P. G., Deshpande A., and Nadkarni P. (1999) Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents. J. Am. Med. Inform. Assoc. 8, 598–609.
Nadkarni P. M., Marenco L., Chen R., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Organization of Heterogeneous Scientific Data Using the EAV/CR Representation. J. Am. Med. Inform. Assn. 6, 478–493.
Pinker S. (1994) The Language Instinct, Harper-Collins, London, pp. 177–178.
Prager J. M. (1999) Linguini: Language Indentification for Multilingual Documents. Proc. 32nd Hawaii Int. Sys. 1–11.
Qian J., Colmers W. F., and Saggau P. (1997) Inhibition of Synaptic Transmission by Neuropeptide Y in Rat Hippocampal Area CA1: Modulation of Presynaptic Ca2+ Entry. J Neurosci. 17, 8169–8177.
Raghavan V. V., Jung G. S., and Bolling P. (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM. Tr. Inform. Sys. 7, 205–229.
Schomburg I., Chang A., and Schomburg D. (2002) BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 30, 47–49.
Shepherd G. M., Mirsky J. S., Healy M. D., et al. (1998) The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data. Trends Neurosci. 21, 460–468.
Spitzer R. and Fleiss J. (1982) A design-independent method for measuring the reliability of psychiatric diagnosis. J. Psychiat. Res. 17, 335–342.
Sun Q.-Q. and Dale N. (1998) Differential inhibition of N and P/Q Ca2+ currents by 5HT1A and 5HT1D receptors in spinal neurons of Xenopus larvae. J. Physiol. 510, 103–120.
Tague-Sutcliffe J. (1992) Measuring the informativeness of a retrieval process. Proc. 15th Ann. Intern. ACM SIGIR Conf. Res. Dev. Inform. Retrieval. Denmark. pp. 23–36.
Toth Z., Hollrigel G. S., Gorcs T., and Soltesz, I. (1997) Instantaneous Perturbation of Dentate Interneuronal Networks by a Pressure Wave-Transient Delivered to the Neocortex. J. Neurosci. 17, 8106–8117.
Weeber M., Mork J. and Aronson A. R. (2001) Developing a test collection for biomedical word sense disambiguation. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 746–750.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Crasto, C.J., Marenco, L.N., Migliore, M. et al. Text mining neuroscience journal articles to populate neuroscience databases. Neuroinform 1, 215–237 (2003). https://doi.org/10.1385/NI:1:3:215
Issue Date:
DOI: https://doi.org/10.1385/NI:1:3:215