A hidden Markov model‐based approach for extracting information from web news
International Journal of Web Information Systems
ISSN: 1744-0084
Article publication date: 28 September 2007
Abstract
Purpose
This paper aims to present a method based on hidden Markov models (HMM) for extracting information from web news.
Design/methodology/approach
The samples under study are derived from the contents of PROC “People's Daily Online,” a web‐based news publication containing non‐structured archives. This study focuses on developing HMM‐based tools for news filtering in order to retrieve terms of interest, such as “Geo‐location,” “System,” and “Personas.” The experiments are performed in two stages. In the first stage, each HMM being built is exclusively serving for extracting unique target term in order to evaluate the fundamental information extraction (IE) capability. In the second stage, the experiment is then extended to resolve a more complex, multi‐term extraction issue.
Findings
The results reveal that, by using HMMs as a basis, the accuracies (F‐measure) for unique IE tasks can achieve more than 70 per cent on average, while no fewer than 66 per cent accuracies are obtained for multi‐term extraction.
Originality/value
The study reveals the promising of using HMM for developing automatic tool in filtering free‐structured data.
Keywords
Citation
Tso, B. (2007), "A hidden Markov model‐based approach for extracting information from web news", International Journal of Web Information Systems, Vol. 3 No. 1/2, pp. 104-115. https://doi.org/10.1108/17440080710829243
Publisher
:Emerald Group Publishing Limited
Copyright © 2007, Emerald Group Publishing Limited