Improving ME accuracy

Our main goal is to find a method that will automatically obtain the best feature selection Veenstra2000,MihalceaCOLING2002,suarezCOLING2002 from the training data. We performed an

-fold cross-validation process. Data is divided in

folds; then,

tests are done, each one with

folds as training data and the remaining one as testing data. The final result is the average accuracy. We decided on just three tests because of the small size of the training data. Then, we tested several combinations of features over the training data of the SENSEVAL-2 Spanish lexical-sample and analyzed the results obtained for each word.

Table: Three-fold Cross-Validation Results on SENSEVAL-2 Spanish Training Data: Best Averaged Accuracies per Word

Word	Features	Accur	MFS	Word	Features	Accur	MFS
autoridad,N	sbcp	0.589	0.503	clavar,V	sbcprdk3	0.561	0.449
bomba,N	0LWSBCk5	0.762	0.707	conducir,V	LWsBCPD	0.534	0.358
canal,N	sbcprdk3	0.579	0.307	copiar,V	0sbcprdk3	0.457	0.338
circuito,N	0LWSBCk5	0.536	0.392	coronar,V	sk5	0.698	0.327
corazón,N	0Sbcpk5	0.781	0.607	explotar,V	0LWSBCk5	0.593	0.318
corona,N	sbcp	0.722	0.489	saltar,V	LWsBC	0.403	0.132
gracia,N	0sk5	0.634	0.295	tocar,V	0sbcprdk3	0.583	0.313
grano,N	0LWSBCr	0.681	0.483	tratar,V	sbcpk5	0.527	0.208
hermano,N	0Sprd	0.731	0.602	usar,V	0Sprd	0.732	0.669
masa,N	LWSBCk5	0.756	0.455	vencer,V	sbcprdk3	0.696	0.618
naturaleza,N	sbcprdk3	0.527	0.424	brillante,A	sbcprdk3	0.756	0.512
operación,N	0LWSBCk5	0.543	0.377	ciego,A	0spdk5	0.812	0.565
órgano,N	0LWSBCPDk5	0.715	0.515	claro,A	0Sprd	0.919	0.854
partido,N	0LWSBCk5	0.839	0.524	local,A	0LWSBCr	0.798	0.750
pasaje,N	sk5	0.685	0.451	natural,A	sbcprdk10	0.471	0.267
programa,N	0LWSBCr	0.587	0.486	popular,A	sbcprdk10	0.865	0.632
tabla,N	sk5	0.663	0.488	simple,A	LWsBCPD	0.776	0.621
actuar,V	sk5	0.514	0.293	verde,A	LWSBCk5	0.601	0.317
apoyar,V	0sbcprdk3	0.730	0.635	vital,A	Sbcp	0.774	0.441
apuntar,V	0LWsBCPDk5	0.661	0.478

In order to perform the 3-fold cross-validation process on each word, some preprocessing of the corpus was done. For each word, all senses were uniformly distributed into the three folds (each fold contains one-third of the examples of each sense). Those senses that had fewer than three examples in the original corpus file were rejected and not processed.

Table 8 shows the best results obtained using three-fold cross-validation on the training data. Several feature combinations were tested in order to find the best set for each selected word. The purpose was to obtain the most relevant information for each word from the corpus rather than applying the same combination of features to all of them. Therefore, the information in the column Features lists only the feature selection with the best result. Strings in each row represent the entire set of features used when training each classifier. For example, autoridad obtains its best result using nearest words, collocations of two lemmas, collocations of two words, and POS information that is,

, and

features, respectively (see Figure 9). The column Accur (for ``accuracy'') shows the number of correctly classified contexts divided by the total number of contexts (because ME always classifies precision as equal to recall). Column MFS shows the accuracy obtained when the most frequent sense is selected.

The data summarized in Table 8 reveal that using ``collapsed'' features in the ME method is useful; both ``collapsed'' and ``non-collapsed'' functions are used, even for the same word. For example, the adjective vital obtains the best result with ``

'' (the ``collapsed'' version of words in a window

, collocations of two lemmas and two words in a window

, and POS labels, in a window

too); we can here infer that single-word information is less important than collocations in order to disambiguate vital correctly.

The target word (feature 0) is useful for nouns, verbs, and adjectives, but many of the words do not use it for their best feature selection. In general, these words do not have a relevant relationship between shape and senses. On the other hand, POS information (

and

features) is selected less often. When comparing

features with

features (e.g.,

versus

, and

versus

), they are complementary in the majority of cases. Grammatical relationships (

features) and word-word dependencies (

and

features) seem very useful, too, if combined with other types of attributes. Moreover, keywords (

m features) are used very often, possibly due to the source and size of contexts of SENSEVAL-2 Spanish lexical-sample data.

Table 9 shows the best feature selections for each part-of-speech and for all words. The data presented in Tables 8 and 9 were used to build four different sets of classifiers in order to compare their accuracy: MEfix uses the overall best feature selection for all words; MEbfs trains each word with its best selection of features (in Table 8); MEbfs.pos uses the best selection per POS for all nouns, verbs and adjectives, respectively (in Table 9); and, finally, vME is a majority voting system that has as input the answers of the preceding systems.

Table 10 shows a comparison of the four systems. MEfix has the lower results. This classifier applies the same set of types of features to all words. However, the best feature selection per word (MEbfs) is not the best, probably because more training examples are necessary. The best choice seems to select a fixed set of types of features for each POS (MEbfs.pos).

Table: Evaluation of ME Systems

	ALL		Nouns
0.677	MEbfs.pos	0.683	MEbfs.pos
0.676	vME	0.678	vME
0.667	MEbfs	0.661	MEbfs
0.658	MEfix	0.646	MEfix
	Verbs		Adjectives
0.583	vME	0.774	vME
0.583	MEbfs.pos	0.772	MEbfs.pos
0.583	MEfix	0.771	MEbfs
0.580	MEbfs	0.756	MEfix

MEfix:	sbcprdk3 for all words
MEbfs:	each word with its
	best feature selection
MEbfs.pos:	LWSBCk5 for nouns,
	sbcprdk3 for verbs,
	and 0spdk5 for adjectives
vME:	majority voting between MEfix,
	MEbfs.pos, and MEbfs

While MEbfs predicts, for each word over the training data, which individually selected features could be the best ones when evaluated on the testing data, MEbfs.pos is an averaged prediction, a selection of features that, over the training data, performed a ``good enough'' disambiguation of the majority of words belonging to a particular POS. When this averaged prediction is applied to the real testing data, MEbfs.pos performs better than MEbfs.

Another important issue is that MEbfs.pos obtains an accuracy slightly better than the best possible evaluation result achieved with ME (see Table 7)--that is, a best-feature-selection per POS strategy from training data guarantees an improvement on ME-based WSD.

In general, verbs are difficult to learn and the accuracy of the method for them is lower than for other POS; in our opinion, more information (knowledge-based, perhaps) is needed to build their classifiers. In this case, the voting system (vME) based on the agreement between the other three systems, does not improve accuracy.

Finally in Table 11, the results of the ME method are compared with those systems that competed at SENSEVAL-2 in the Spanish lexical-sample task⁵. The results obtained by ME systems are excellent for nouns and adjectives, but not for verbs. However, when comparing ALL POS, the ME systems seem to perform comparable to the best SENSEVAL-2 systems.

Table: Comparison with the Spanish SENSEVAL-2 systems

	ALL		Nouns		Verbs		Adjectives
0.713	jhu(R)	0.702	jhu(R)	0.643	jhu(R)	0.802	jhu(R)
0.682	jhu	0.683	MEbfs.pos	0.609	jhu	0.774	vME
0.677	MEbfs.pos	0.681	jhu	0.595	css244	0.772	MEbfs.pos
0.676	vME	0.678	vME	0.584	umd-sst	0.772	css244
0.670	css244	0.661	MEbfs	0.583	vME	0.771	MEbfs
0.667	MEbfs	0.652	css244	0.583	MEbfs.pos	0.764	jhu
0.658	MEfix	0.646	MEfix	0.583	MEfix	0.756	MEfix
0.627	umd-sst	0.621	duluth 8	0.580	MEbfs	0.725	duluth 8
0.617	duluth 8	0.612	duluth Z	0.515	duluth 10	0.712	duluth 10
0.610	duluth 10	0.611	duluth 10	0.513	duluth 8	0.706	duluth 7
0.595	duluth Z	0.603	umd-sst	0.511	ua	0.703	umd-sst
0.595	duluth 7	0.592	duluth 6	0.498	duluth 7	0.689	duluth 6
0.582	duluth 6	0.590	duluth 7	0.490	duluth Z	0.689	duluth Z
0.578	duluth X	0.586	duluth X	0.478	duluth X	0.687	ua
0.560	duluth 9	0.557	duluth 9	0.477	duluth 9	0.678	duluth X
0.548	ua	0.514	duluth Y	0.474	duluth 6	0.655	duluth 9
0.524	duluth Y	0.464	ua	0.431	duluth Y	0.637	duluth Y

Improving ME accuracy

Footnotes