Transactions on Machine Learning and Data Mining (ISSN: 1865-6781)

Volume 8 - Number 2 - October 2015 - Pages 61-76

Extraction of Chemical and Drug Named Entities by Ensemble Learning Using Chemical NER Tools Based on Different Extraction Guidelines

Thaer M.Dieb and Masaharu Yoshioka

Graduate School of Information Science and Technology, Hokkaido University, Japan


Chemical named-entity recognition (chemical NER) is the task of extracting chemical information and chemical-related entities such as drug names and source materials from text in several domains such as bioinformatics and nanoinformatics. There have been several attempts to construct corpora for handling such chemical-related information based on different corpus-construction guidelines. Even though these guidelines contain common types of chemical information, they differ in several ways. As a result, chemical NER tools developed for a particular guideline might be able to extract common chemical named entities, but they may have problems extracting other chemical-related entities. Assuming the differences between these guidelines are consistent, the pattern of success and failure of the chemical NER tools might also be consistent. In this paper, we present an ensemble-learning approach that uses the conditional random field (CRF) as a machine-learning technique to fuse a variety of different characteristic chemical NER tools based on different guidelines to construct a chemical NER for a particular guideline. To achieve consistent tokenization across these different tools, we applied a post-tokenization mechanism. We evaluated the system using the BioCreative IV, CHEMDNER task datasets. We confirmed that the ensemble-learning approach using a combination of chemical NER tools is better than a simple domain-adaptation approach using just one chemical NER tool. We also confirmed that the ensemble-learning approach could improve the performance of a well-tuned rule-based chemical NER tool on certain tasks.

Keywords:Chemical named entities recognition, Ensemble learning, Conditional random field, Text tokenization

PDFDownload Paper (1232 KB)

Back to Table of Contents