TEXT DOCUMENT INFORMATION RETRIEVAL BASED ON CONCEPTS
V
ABSTRACT: The huge volume of
digital information collected automatically by internet technology has caused
problems in information retrieval.
Finding the right information from
a large collection is very difficult. The difficulty in most search
engines are caused by a string matching algorithm that return a match whenever
an exact occurrence of the search term is found. To address this problem and
considering that the document collection
is not only a collection of words but also a collection of concepts, we promote
a new technique of information retrieval that is based on concepts.
The difference between word-based and
concept-based technique are indexing and retrieval. During indexing,
this technique classifies documents into concepts extracted from the collection
via clustering technique to construct
concept indexing besides term indexing. During retrieval, this
techniques ranks document base on a
combination of term and conceptual similarity, in the formulation of doc-score =
β * conceptScore + (1-β)*TermScore where
β is the weight of concept score. The clustering algorithm is chosen
from partitional model that linear in complexity, that is Bisecting
K-Means.
Two kinds of test collections, i.e. text document of news (1000 and 3000
news documents), and text document of academic articles (1000 academic abstract
in information technology) were used to conduct the experiment. Performance
evaluation was measured using average
precision and R-precision.
The results of the research showed that by setting β =0.5 to
β =0.9 would improve significantly the precision of concept-based approach
over the word-based only (β =0). The improvements are about 5.2% to 8,3% for
average precision and 16.9% to 31.5% for R-precision.
Author: Amir Hamzah
Journal Code: jptinformatikagg110003