{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,9,9]],"date-time":"2023-09-09T20:55:25Z","timestamp":1694292925806},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"S1","license":[{"start":{"date-parts":[[2015,1,19]],"date-time":"2015-01-19T00:00:00Z","timestamp":1421625600000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2015,12]]},"abstract":"Abstract<\/jats:title>\n \n Background<\/jats:title>\n In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM.<\/jats:p>\n <\/jats:sec>\n \n Results<\/jats:title>\n Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion). In our post-challenge system, the cost parameter in CRF model was optimized by 10-fold cross validation with grid search, and word representations feature induced by Brown clustering method was introduced. For the CEM subtask, our official runs were ranked in top position by obtaining maximum 88.79% precision, 69.08% recall and 77.70% balanced F-measure, which were improved further to 88.43% precision, 76.48% recall and 82.02% balanced F-measure in our post-challenge system.<\/jats:p>\n <\/jats:sec>\n \n Conclusions<\/jats:title>\n In our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem. Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in CRF model. From our practice and lessons, if one directly utilizes some open-source natural language processing (NLP) toolkits, such as OpenNLP, Standford CoreNLP, false positive (FP) rate may be very high. It is better to develop some additional rules to minimize the FP rate if one does not want to re-train the related models. Our CEM recognition system is available at: http:\/\/www.SciTeMiner.org\/XuShuo\/Demo\/CEM<\/jats:ext-link>.<\/jats:p>\n <\/jats:sec>","DOI":"10.1186\/1758-2946-7-s1-s11","type":"journal-article","created":{"date-parts":[[2015,6,18]],"date-time":"2015-06-18T13:06:03Z","timestamp":1434632763000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature"],"prefix":"10.1186","volume":"7","author":[{"given":"Shuo","family":"Xu","sequence":"first","affiliation":[]},{"given":"Xin","family":"An","sequence":"additional","affiliation":[]},{"given":"Lijun","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"Yunliang","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Haodong","family":"Zhang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2015,1,19]]},"reference":[{"issue":"Suppl 1","key":"630_CR1","doi-asserted-by":"publisher","first-page":"S1","DOI":"10.1186\/1758-2946-7-S1-S1","volume":"7","author":"M Krallinger","year":"2015","unstructured":"Krallinger M, Leitner F, Rabal O, Vazquez M, Miguel J, Valencia A: CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform. 2015, 7 (Suppl 1): S1-","journal-title":"J Cheminform"},{"issue":"7","key":"630_CR2","doi-asserted-by":"publisher","first-page":"1000450","DOI":"10.1371\/journal.pcbi.1000450","volume":"5","author":"J Li","year":"2009","unstructured":"Li J, Zhu X, Chen JY: Building disease-specific drug-protein connectivity maps from molecular interaction networks and pubmed abstracts. PLoS Computational Biology. 2009, 5 (7): 1000450-10.1371\/journal.pcbi.1000450. doi:10.1371\/journal.pcbi.1000450","journal-title":"PLoS Computational Biology"},{"issue":"17","key":"630_CR3","first-page":"1","volume":"6","author":"S Eltyeb","year":"2014","unstructured":"Eltyeb S, Salim N: Chemical named entities recognition: A review on approaches and applications. Journal of Cheminformatics. 2014, 6 (17): 1-12. doi:10.1186\/1758-2946-6-17","journal-title":"Journal of Cheminformatics"},{"issue":"6-7","key":"630_CR4","doi-asserted-by":"publisher","first-page":"506","DOI":"10.1002\/minf.201100005","volume":"30","author":"M Vazquez","year":"2011","unstructured":"Vazquez M, Krallinger M, Leitner F, Valencia A: Text mining for drugs and chemical compound: Methods, tools and applications. Molecular Informatics. 2011, 30 (6-7): 506-519. 10.1002\/minf.201100005. doi:10.1002\/minf.201100005","journal-title":"Molecular Informatics"},{"issue":"Suppl 2","key":"630_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/gb-2008-9-s2-s1","volume":"9","author":"M Krallinger","year":"2008","unstructured":"Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of text-mining systems for biology: Overview of the second BioCreative community challenge. Genome Biology. 2008, 9 (Suppl 2): 1-10.1186\/gb-2008-9-s2-s1. doi:10.1186\/gb-2008-9-S2-S1","journal-title":"Genome Biology"},{"key":"630_CR6","first-page":"152","volume-title":"Proceedings of the 4th BioCreative Challenge Evaluation Workshop","author":"S Xu","year":"2013","unstructured":"Xu S, An X, Zhu L, Zhang Y, Zhang H: A CRF-based system for recognizing chemical entities in biomedical literature. Proceedings of the 4th BioCreative Challenge Evaluation Workshop. Edited by: Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A. 2013, 2: 152-157."},{"key":"630_CR7","first-page":"1360","volume-title":"where |CEM| means the number of token components of a CEM. Take \"[C(8)mim][PF(6)]\" in Table 8 as an Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering","author":"S Xu","year":"2007","unstructured":"Xu S, Ma F, Tao L: Learn from the information contained in the false splice sites as well as in the true splice sites using SVM. where |CEM| means the number of token components of a CEM. Take \"[C(8)mim][PF(6)]\" in Table 8 as an Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering. 2007, Atlantis Press, Amsterdam, Netherlands, 1360-1366. doi:10.2991\/iske.2007.13"},{"key":"630_CR8","volume-title":"PhD thesis","author":"S Xu","year":"2008","unstructured":"Xu S: Selenoprotein genes prediction in silico based on machine learning approaches. PhD thesis. 2008, China Agricultural University"},{"key":"630_CR9","volume-title":"Proceedings of the International Conference on Learning Representations","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Chen K, Corrado G, Dean J: Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations. 2013"},{"key":"630_CR10","volume-title":"Master's thesis","author":"P Liang","year":"2005","unstructured":"Liang P: Semi-supervised learning for natural language. Master's thesis. 2005, Massachusetts Institute of Technology"},{"key":"630_CR11","first-page":"384","volume-title":"Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA","author":"J Turian","year":"2010","unstructured":"Turian J, Ratinov L, Bengio Y: Word representations: A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA. 2010, 384-394."},{"key":"630_CR12","first-page":"282","volume-title":"Proceedings of the 18th International Conference on Machine Learning","author":"J Lafferty","year":"2001","unstructured":"Lafferty J, McCallum A, Pereira F: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning. 2001, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282-289."},{"issue":"Suppl 1","key":"630_CR13","doi-asserted-by":"publisher","first-page":"S2","DOI":"10.1186\/1758-2946-7-S1-S2","volume":"7","author":"M Krallinger","year":"2015","unstructured":"Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, Sayle RA, Batista-Navarro RT, Rak R, Huber T, Rocktaschel T, Matos S, Campos D, Tang B, Xu H, Munkhdalai T, Ryu KH, Ramanan SV, Nathan S, Zitnik S, Bajec M, Weber L, Irmer M, Akhondi SA, Kors JA, Xu S, An X, Sikdar UK, Ekbal A, Yoshioka M, Dieb TM, Choi M, Verspoor K, Khabsa M, Giles CL, Liu H, Ravikumar KE, Lamurias A, Couto FM, Dai H, Tsai RT, Ata C, Can T, Usie A, Alves R, Segura-Bedmar I, Martinez P, Oryzabal J, Valencia A: The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform. 2015, 7 (Suppl 1): S2-","journal-title":"J Cheminform"},{"key":"630_CR14","first-page":"213","volume-title":"Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Lingustics, Stroudsburg, PA, USA","author":"F Sha","year":"2003","unstructured":"Sha F, Pereira F: Shallow parsing with conidtional random fields. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Lingustics, Stroudsburg, PA, USA. 2003, 213-220. doi:10.3115\/1073445.1073473"},{"key":"630_CR15","first-page":"337","volume-title":"Proceedings of Conference on Human Language Technology\/North American Chapter of the Association for Computational Linguiustics Annual Meeting","author":"S Miller","year":"2004","unstructured":"Miller S, Guinness J, Zamanian A: Name tagging with word clusters and discriminative training. Proceedings of Conference on Human Language Technology\/North American Chapter of the Association for Computational Linguiustics Annual Meeting. 2004, Association for Computational Linguistics, Boston, Massachusetts, 337-342."},{"key":"630_CR16","first-page":"119","volume":"23","author":"K Ganchev","year":"2007","unstructured":"Ganchev K, Crammer K, Pereira F, Mann G, Bellare K, McCallum A, Carroll S, Jin Y, White P: Penn\/Umass\/CHOP BioCreative II systems. Proceedings of the 2nd BioCreative Challenge Evaluation Workshop. 2007, 23: 119-124.","journal-title":"Proceedings of the 2nd BioCreative Challenge Evaluation Workshop"},{"issue":"4","key":"630_CR17","first-page":"467","volume":"18","author":"PF Brown","year":"1992","unstructured":"Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC: Class-based n-gram models of natural language. Computational Linguistics. 1992, 18 (4): 467-479.","journal-title":"Computational Linguistics"},{"key":"630_CR18","first-page":"141","volume-title":"Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Lingustics, Stroudsburg, PA, USA","author":"JR Finkel","year":"2009","unstructured":"Finkel JR, Manning CD: Nested named entity recognition. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Lingustics, Stroudsburg, PA, USA. 2009, 141-150."},{"key":"630_CR19","unstructured":"The Apache OpenNLP Library. [http:\/\/opennlp.apache.org\/index.html]"},{"key":"630_CR20","first-page":"985","volume-title":"Proceedings of the 24nd International Conference on Computational Linguistics","author":"J Read","year":"2012","unstructured":"Read J, Dridan R, Oepen S, Solberg LJ: Sentence boundary detection: A long solved problem?. Proceedings of the 24nd International Conference on Computational Linguistics. Edited by: Kay M, Boitet C. 2012, Indian Institute of Technology Bombay, Mumbai, Maharashtra, India, 985-994."},{"issue":"11","key":"630_CR21","doi-asserted-by":"publisher","first-page":"1433","DOI":"10.1093\/bioinformatics\/btt156","volume":"129","author":"C-H Wei","year":"2013","unstructured":"Wei C-H, Harris BR, Kao H-Y, Lu Z: tmVar: A text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013, 129 (11): 1433-1439.","journal-title":"Bioinformatics"},{"issue":"Suppl 1","key":"630_CR22","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1186\/1471-2105-6-S1-S6","volume":"6","author":"R McDonald","year":"2005","unstructured":"McDonald R, Pereira F: Identifying gene and protein mmention in text using conditional random fields. BMC Bioinformatics. 2005, 6 (Suppl 1): 6-10.1186\/1471-2105-6-S1-S6. doi:10.1186\/1471-2105-6-S1-S6","journal-title":"BMC Bioinformatics"},{"key":"630_CR23","first-page":"109","volume":"23","author":"H-S Huang","year":"2007","unstructured":"Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M, Yang B-H, Chung I-F, Hsu C-N: High-recall gene mention recognition by unification of multiple background parsing models. Proceedings of the 2nd BioCreative Challenge Evaluation Workshop. 2007, 23: 109-111.","journal-title":"Proceedings of the 2nd BioCreative Challenge Evaluation Workshop"},{"key":"630_CR24","first-page":"89","volume-title":"Proceedings of the 2nd BioCreative Challenge Evaluation Workshop","author":"R Klinger","year":"2007","unstructured":"Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius M: Named entity recognition with combinations of conditional random fields. Proceedings of the 2nd BioCreative Challenge Evaluation Workshop. Edited by: Hirschmann L, Krallinger M, Valencia A. 2007, 89-92."},{"issue":"3","key":"630_CR25","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1007\/BF01589116","volume":"45","author":"DC Liu","year":"1989","unstructured":"Liu DC, Nocedal J: On the limited memory BFGS method for large scale optimization. Mathematical Programming. 1989, 45 (3): 503-528. doi:10.1007\/BF01589116","journal-title":"Mathematical Programming"},{"key":"630_CR26","unstructured":"Kudo T: CRF++: Yet Another CRF Toolkit. [http:\/\/crfpp.googlecode.com\/svn\/trunk\/doc\/index.html]"},{"issue":"3","key":"630_CR27","doi-asserted-by":"publisher","first-page":"130","DOI":"10.1108\/eb046814","volume":"14","author":"MF Porter","year":"1980","unstructured":"Porter MF: An algorithm for suffix stripping. Program. 1980, 14 (3): 130-137. 10.1108\/eb046814.","journal-title":"Program"},{"key":"630_CR28","unstructured":"Manning C, Bauer J: Stanford CoreNLP - A Suite of NLP Tools. [http:\/\/nlp.stanford.edu\/software\/corenlp.shtml]"},{"key":"630_CR29","volume-title":"Proceedings of the 25th International Conference on Machine Learning","author":"R Collobert","year":"2008","unstructured":"Collobert R, Weston J: A unified architecture for natural language processing: Deep neural networks with multitask learning. Proceedings of the 25th International Conference on Machine Learning. 2008"},{"key":"630_CR30","first-page":"1081","volume-title":"Advances in Neural Information Processing Systems 21","author":"A Mnih","year":"2009","unstructured":"Mnih A, Andriy G: A scalable hierarchical distributed language model. Advances in Neural Information Processing Systems 21. Edited by: Koller D, Schuurmans D, Bengio Y, Bottou L. 2009, MIT Press, Cambridge, MA, 1081-1088."},{"key":"630_CR31","unstructured":"Liang P: C++ Implementation of the Brown Word Clustering Algorithm. [https:\/\/github.com\/percyliang\/brown-cluster]"},{"issue":"9","key":"630_CR32","doi-asserted-by":"publisher","first-page":"1098","DOI":"10.1109\/JRPROC.1952.273898","volume":"40","author":"DA Huffman","year":"1952","unstructured":"Huffman DA: A method for the construction of minimum-redundancy codes. Proceedings of the I.R.E. 1952, 40 (9): 1098-1101. doi:10.1109\/JRPROC.1952.273898","journal-title":"Proceedings of the I.R.E"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1758-2946-7-S1-S11.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/1758-2946-7-S1-S11\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1758-2946-7-S1-S11.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,2]],"date-time":"2021-09-02T19:12:27Z","timestamp":1630609947000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/1758-2946-7-S1-S11"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,1,19]]},"references-count":32,"journal-issue":{"issue":"S1","published-print":{"date-parts":[[2015,12]]}},"alternative-id":["630"],"URL":"https:\/\/doi.org\/10.1186\/1758-2946-7-s1-s11","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,1,19]]},"assertion":[{"value":"19 January 2015","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S11"}}