Our main goal is to find a method that will automatically obtain
the best feature selection
Veenstra2000,MihalceaCOLING2002,suarezCOLING2002 from
the training data. We performed an -fold cross-validation
process. Data is divided in
folds; then,
tests are
done, each one with
folds as training data and the remaining
one as testing data. The final result is the average accuracy.
We decided on just three tests because of the small size of the
training data. Then, we tested several combinations of features
over the training data of the SENSEVAL-2 Spanish
lexical-sample and analyzed the results obtained for each word.
In order to perform the 3-fold cross-validation process on each word, some preprocessing of the corpus was done. For each word, all senses were uniformly distributed into the three folds (each fold contains one-third of the examples of each sense). Those senses that had fewer than three examples in the original corpus file were rejected and not processed.
Table 8 shows the best results obtained using
three-fold cross-validation on the training data. Several feature combinations were tested in order
to find the best set for each selected word. The purpose was to
obtain the most relevant information for each word from the corpus
rather than applying the same combination of features to all of
them. Therefore, the information in the column Features
lists only the feature selection with the best result. Strings in
each row represent the entire set of features used when training
each classifier. For example, autoridad obtains its best
result using nearest words, collocations of two lemmas,
collocations of two words, and POS information that is, ,
,
, and
features, respectively (see Figure 9).
The column Accur (for ``accuracy'') shows the number of
correctly classified contexts divided by the total number of
contexts (because ME always classifies precision as equal to
recall). Column MFS shows the accuracy obtained when the
most frequent sense is selected.
The data summarized in Table 8 reveal that using
``collapsed'' features in the ME method is useful; both
``collapsed'' and ``non-collapsed'' functions are used,
even for the same word. For example, the adjective vital
obtains the best result with ``'' (the ``collapsed''
version of words in a window
, collocations of two
lemmas and two words in a window
, and POS labels, in
a window
too); we can here infer that single-word
information is less important than collocations in order to
disambiguate vital correctly.
The target word (feature 0) is useful for nouns, verbs, and
adjectives, but many of the words do not use it for their best
feature selection. In general, these words do not have a relevant
relationship between shape and senses. On the other hand, POS
information ( and
features) is selected less often. When
comparing
features with
features (e.g.,
versus
, and
versus
), they are complementary in the majority
of cases. Grammatical relationships (
features) and word-word
dependencies (
and
features) seem very useful, too, if
combined with other types of attributes. Moreover, keywords (
m
features) are used very often, possibly due to the source and size
of contexts of SENSEVAL-2 Spanish lexical-sample data.
Table 9 shows the best feature selections for each part-of-speech and for all words. The data presented in Tables 8 and 9 were used to build four different sets of classifiers in order to compare their accuracy: MEfix uses the overall best feature selection for all words; MEbfs trains each word with its best selection of features (in Table 8); MEbfs.pos uses the best selection per POS for all nouns, verbs and adjectives, respectively (in Table 9); and, finally, vME is a majority voting system that has as input the answers of the preceding systems.
|
Table 10 shows a comparison of the four systems. MEfix has the lower results. This classifier applies the same set of types of features to all words. However, the best feature selection per word (MEbfs) is not the best, probably because more training examples are necessary. The best choice seems to select a fixed set of types of features for each POS (MEbfs.pos).
|
While MEbfs predicts, for each word over the training data, which individually selected features could be the best ones when evaluated on the testing data, MEbfs.pos is an averaged prediction, a selection of features that, over the training data, performed a ``good enough'' disambiguation of the majority of words belonging to a particular POS. When this averaged prediction is applied to the real testing data, MEbfs.pos performs better than MEbfs.
Another important issue is that MEbfs.pos obtains an accuracy slightly better than the best possible evaluation result achieved with ME (see Table 7)--that is, a best-feature-selection per POS strategy from training data guarantees an improvement on ME-based WSD.
In general, verbs are difficult to learn and the accuracy of the method for them is lower than for other POS; in our opinion, more information (knowledge-based, perhaps) is needed to build their classifiers. In this case, the voting system (vME) based on the agreement between the other three systems, does not improve accuracy.
Finally in Table 11, the results of the ME method are compared with those systems that competed at SENSEVAL-2 in the Spanish lexical-sample task5. The results obtained by ME systems are excellent for nouns and adjectives, but not for verbs. However, when comparing ALL POS, the ME systems seem to perform comparable to the best SENSEVAL-2 systems.
|