SENSEVAL-2 Spanish lexical-sample task

vME+SM is an enrichment of vME: we added the SM classifier to the combination of the three ME systems in vME (see Section 3.3). The results on the Spanish lexical-sample task from SENSEVAL-2 are shown in Table 17. Because it only works with nouns, vME+SM improves accuracy for them only, but obtains the same score as JHU(R) while the overall score reaches the second place.


Table: vME+SM in the Spanish lexical-sample task of SENSEVAL-2

ALL Nouns

0.713

jhu(R) 0.702 jhu(R)
0.684 vME+SM 0.702 vME+SM
0.682 jhu 0.683 MEbfs.pos
0.677 MEbfs.pos 0.681 jhu
0.676 vME 0.678 vME
0.670 css244 0.661 MEbfs
0.667 MEbfs 0.652 css244
0.658 MEfix 0.646 MEfix
0.627 umd-sst 0.621 duluth 8
0.617 duluth 8 0.612 duluth Z
0.610 duluth 10 0.611 duluth 10
0.595 duluth Z 0.603 umd-sst
0.595 duluth 7 0.592 duluth 6
0.582 duluth 6 0.590 duluth 7
0.578 duluth X 0.586 duluth X
0.560 duluth 9 0.557 duluth 9
0.548 ua 0.514 duluth Y
0.524 duluth Y 0.464 ua


These results show that methods like SM and ME can be combined in order to achieve good disambiguation results. Our results are in line with those of Pedersen2002, which also presents a comparative evaluation between the systems that participated in the Spanish and English lexical-sample tasks of SENSEVAL-2. Their focus is on pair comparisons between systems to assess the degree to which they agree, and on measuring the difficulty of the test instances included in these tasks. If several systems are largely in agreement, then there is little benefit in combining them since they are redundant and they will simply reinforce each other. However, if some systems disambiguate instances that others do not, then the systems are complementary and it may be possible to combine them to take advantage of the different strengths of each system to improve overall accuracy.

The results for nouns (only applying SM), shown in Table 18, indicate that SM has a low level of agreement with all the other methods. However, the measure of optimal combination is quite high, reaching 89% (1.00-0.11) for the pairing of SM and JHU. In fact, all seven of the other methods achieved their highest optimal combination value when paired with the SM method.


Table 18: Optimal combination between the systems that participated in the Spanish lexical-sample tasks of SENSEVAL-2
System pair for nouns Both OK1 One OK 2 Zero OK 3 Kappa 4
SM and JHU 0.29 0.32 0.11 0.06
SM and Duluth7 0.27 0.34 0.12 0.03
SM and DuluthY 0.25 0.35 0.12 0.01
SM and Duluth8 0.28 0.32 0.13 0.08
SM and Cs224 0.28 0.32 0.13 0.09
SM and Umcp 0.26 0.33 0.14 0.06
SM and Duluth9 0.26 0.31 0.16 0.14

1 Percentage of instances where both systems answers were correct.
2 Percentage of instances where only one answer is correct.
3 Percentage of instances where none of both answers is correct.
4 The kappa statistic Cohen1960 is a measure of agreement between multiple systems (or judges) that is scaled by the agreement that would be expected just by chance. A value of 1.00 suggests complete agreement, while 0.00 indicates pure chance agreement.

This combination of circumstances suggests that SM, being a knowledge-based method, is fundamentally different from the others (i.e., corpus-based) methods, and is able to disambiguate a certain set of instances where the other methods fail. In fact, SM is different in that it is the only method that uses the structure of WordNet.