Abstract:
An under-explored question in cross-language information retrieval
(CLIR) is to what degree the performance of CLIR methods depends on
the availability of high-quality translation resources for particular
domains. To address this issue, we evaluate several competitive CLIR
methods - with different training corpora - on test documents in the
medical domain. Our results show severe performance degradation when
using a general-purpose training corpus or a commercial machine
translation system (SYSTRAN), versus a domain-specific training
corpus. A related unexplored question is whether we can improve CLIR
performance by systematically analyzing training resources and
optimally matching them to target collections. We start exploring this
problem by suggesting a simple criterion for automatically matching
training resources to target corpora. By using cosine similarity
between training and target corpora as resource weights we obtained an
average of 5.6% improvement over using all resources with no
weights. The same metric yields 99.4% of the performance obtained when
an oracle chooses the optimal resource every time.
This work will also be presented at SIGIR 2004 and is joint work with Yiming Yang. |