Our main goal is to find a method that will automatically obtain the best feature selection Veenstra2000,MihalceaCOLING2002,suarezCOLING2002 from the training data. We performed an -fold cross-validation process. Data is divided in folds; then, tests are done, each one with folds as training data and the remaining one as testing data. The final result is the average accuracy. We decided on just three tests because of the small size of the training data. Then, we tested several combinations of features over the training data of the SENSEVAL-2 Spanish lexical-sample and analyzed the results obtained for each word.
In order to perform the 3-fold cross-validation process on each word, some preprocessing of the corpus was done. For each word, all senses were uniformly distributed into the three folds (each fold contains one-third of the examples of each sense). Those senses that had fewer than three examples in the original corpus file were rejected and not processed.
Table 8 shows the best results obtained using three-fold cross-validation on the training data. Several feature combinations were tested in order to find the best set for each selected word. The purpose was to obtain the most relevant information for each word from the corpus rather than applying the same combination of features to all of them. Therefore, the information in the column Features lists only the feature selection with the best result. Strings in each row represent the entire set of features used when training each classifier. For example, autoridad obtains its best result using nearest words, collocations of two lemmas, collocations of two words, and POS information that is, , , , and features, respectively (see Figure 9). The column Accur (for ``accuracy'') shows the number of correctly classified contexts divided by the total number of contexts (because ME always classifies precision as equal to recall). Column MFS shows the accuracy obtained when the most frequent sense is selected.
The data summarized in Table 8 reveal that using ``collapsed'' features in the ME method is useful; both ``collapsed'' and ``non-collapsed'' functions are used, even for the same word. For example, the adjective vital obtains the best result with ``'' (the ``collapsed'' version of words in a window , collocations of two lemmas and two words in a window , and POS labels, in a window too); we can here infer that single-word information is less important than collocations in order to disambiguate vital correctly.
The target word (feature 0) is useful for nouns, verbs, and adjectives, but many of the words do not use it for their best feature selection. In general, these words do not have a relevant relationship between shape and senses. On the other hand, POS information ( and features) is selected less often. When comparing features with features (e.g., versus , and versus ), they are complementary in the majority of cases. Grammatical relationships ( features) and word-word dependencies ( and features) seem very useful, too, if combined with other types of attributes. Moreover, keywords (m features) are used very often, possibly due to the source and size of contexts of SENSEVAL-2 Spanish lexical-sample data.
Table 9 shows the best feature selections for each part-of-speech and for all words. The data presented in Tables 8 and 9 were used to build four different sets of classifiers in order to compare their accuracy: MEfix uses the overall best feature selection for all words; MEbfs trains each word with its best selection of features (in Table 8); MEbfs.pos uses the best selection per POS for all nouns, verbs and adjectives, respectively (in Table 9); and, finally, vME is a majority voting system that has as input the answers of the preceding systems.
|
Table 10 shows a comparison of the four systems. MEfix has the lower results. This classifier applies the same set of types of features to all words. However, the best feature selection per word (MEbfs) is not the best, probably because more training examples are necessary. The best choice seems to select a fixed set of types of features for each POS (MEbfs.pos).
|
While MEbfs predicts, for each word over the training data, which individually selected features could be the best ones when evaluated on the testing data, MEbfs.pos is an averaged prediction, a selection of features that, over the training data, performed a ``good enough'' disambiguation of the majority of words belonging to a particular POS. When this averaged prediction is applied to the real testing data, MEbfs.pos performs better than MEbfs.
Another important issue is that MEbfs.pos obtains an accuracy slightly better than the best possible evaluation result achieved with ME (see Table 7)--that is, a best-feature-selection per POS strategy from training data guarantees an improvement on ME-based WSD.
In general, verbs are difficult to learn and the accuracy of the method for them is lower than for other POS; in our opinion, more information (knowledge-based, perhaps) is needed to build their classifiers. In this case, the voting system (vME) based on the agreement between the other three systems, does not improve accuracy.
Finally in Table 11, the results of the ME method are compared with those systems that competed at SENSEVAL-2 in the Spanish lexical-sample task5. The results obtained by ME systems are excellent for nouns and adjectives, but not for verbs. However, when comparing ALL POS, the ME systems seem to perform comparable to the best SENSEVAL-2 systems.
|