Most text mining methods are based on representing documents
using a vector space model, commonly known as
a bag of word model, where each document is modeled as
a linear vector representing the occurrence of independent
words in the text corpus. It is well known that using this
vector-based representation, important information, such
as semantic relationship among concepts, is lost. This paper
proposes a novel text representation model called ConceptLink
graph. The ConceptLink graph does not only
represent the content of the document, but also captures
some of its underlying semantic structure in terms of the
relationships among concepts. The ConceptLink graph
is constructed in two main stages. First, we find a set
of concepts by clustering conceptually related terms using
the self-organizing map method. Secondly, by mapping
each document's content to concept, we generate a
graph of concepts based on the occurrences of concepts
using a singular value decomposition technique. The ConceptLink
graph will overcome the keyword independence
limitation in the vector space model to take advantage
of the implicit concept relationships exhibit in all natural
language texts. As an information-rich text representation
model, the ConceptLink graph will advance text mining
technology beyond feature-based to structure-based
knowledge discovery. We will illustrate the ConceptLink
graph method using samples generated from benchmark
text mining dataset. |
Cite as: Chau, R., Tsoi, A.C., Hagenbuchner, M. and Lee, V. (2009). A ConceptLink Graph for Text Structure Mining. In Proc. Thirty-Second Australasian Computer Science Conference (ACSC 2009), Wellington, New Zealand. CRPIT, 91. Mans, B., Ed. ACS. 129-137. |
(from crpit.com)
(local if available)
|