In this thesis, we investigate information retrieval
techniques for Indonesian.
Stemming is the process of reducing morphological
variants of a word to a
common stem form.
Although several stemming algorithms have been
proposed for Indonesian,
there is no consensus on which gives better performance.
We empirically explore these stemming algorithms,
propose novel extensions to the best algorithm,
develop a new Indonesian stemmer, and show that
these can improve stemming correctness.
We propose a range of techniques to enhance the
performance of Indonesian information retrieval.
Our experiments show that many of these techniques
can increase retrieval performance.
We also address the problem of automatic creation of
parallel corpora which are essential for
cross-lingual information retrieval and other
natural language processing tasks, including machine
translation.
We describe algorithms that we have developed to
automatically identify parallel documents for
Indonesian and English.
We also investigate the applicability of our
identification algorithms
for other languages that use the Latin alphabet
including German and French.
techniques for Indonesian.
Stemming is the process of reducing morphological
variants of a word to a
common stem form.
Although several stemming algorithms have been
proposed for Indonesian,
there is no consensus on which gives better performance.
We empirically explore these stemming algorithms,
propose novel extensions to the best algorithm,
develop a new Indonesian stemmer, and show that
these can improve stemming correctness.
We propose a range of techniques to enhance the
performance of Indonesian information retrieval.
Our experiments show that many of these techniques
can increase retrieval performance.
We also address the problem of automatic creation of
parallel corpora which are essential for
cross-lingual information retrieval and other
natural language processing tasks, including machine
translation.
We describe algorithms that we have developed to
automatically identify parallel documents for
Indonesian and English.
We also investigate the applicability of our
identification algorithms
for other languages that use the Latin alphabet
including German and French.