Prev: Generalization
Generalized Example-Based Machine Translation
[This page is under construction -- please check back.]
4. Effectiveness of Generalization
[[Work Area]]
Graphs 1 and 2 show how coverage increases as more text is added to
the example base. Coverage means the percentage of the words in the
input for which the EBMT system is able to generate at least one
candidate translation. The three conditions compared in each graph
are
- example base indexed without tokenization (other than numbers)
- example base indexed using only the tokenization file for replacements
- example base indexed using full recursive matching with both tagged
entries and the tokenization file
For the French-English system using full recursive matching, there are
some 76,000 morphological entries and 550 grammar rules, for a total
of 224,000 words of linguistic data that must be counted toward the
example base; this is why the top curve starts well to the right of
the other curves. The Spanish-English system has much less linguistic
data: some 13,000 morphological entries and 450 grammar rules, for a
total of 43,000 words of overhead. For both systems, the left-most point
on the top curve represents the performance with only the linguistic
data and no actual example sentences.
|
Graph 1: French coverage
|
Graph 2: Spanish coverage
Mere coverage alone is not the only measure of the system's
performance. Another important measure is the size of the pieces for
which it generates translations, the average match length.
Provided that other parameters are not changed, a larger match will
(in general) be of higher quality because it takes more context into
account -- and is thus less likely to pick an incorrect word sense,
make an egregious alignment error, etc. This approximate quality
measure is important because generating manual judgements of
translation quality is a tedious, time-consuming, and expensive task.
Graphs 3 and 4 show how the average match length increases as more
examples are added to the system, with the three curves representing
the same conditions as in Graphs 1 and 2.
|
Graph 3: French match length
|
Graph 4: Spanish match length
As can be seen from these graphs, simple tokenization adds a few
percent to the coverage and average match length, while full recursive
matching substantially increases both once the initial overhead of
morphological entries and grammar rules has been accounted for. A
more important measure in practice, however, is how much text is
required to reach a certain coverage of unrestricted texts; here,
recursive matching with a grammar makes a dramatic difference. To
achieve 80% coverage of French inputs requires about 1.4 million words
of text (French + English) without generalizations, but less than
280,000 words with full recursive matching -- a factor of five, and
most of the text in the latter case is linguistic information rather
than example sentences. Because the performance curve flattens out,
the benefit increases as the coverage level goes up, reaching a
reduction by a factor of eleven at 90% coverage. A similar,
though somewhat less pronounced (due to the smaller amount of lingustic
knowledge), reduction in the required amount of text is seen in the
Spanish system.
Next: Automated Generalization
[LTI Home Page]
[EBMT Main Page]
[Basic System]
[Generalization]
[Applications]
(Last updated 04-Aug-99)