The full tagset of 37 is too large to esimate all models reliably, so
we investigated using smaller tagsets. To find the optimal tagset
size we tested a progression of tagset sizes starting from 37 down to
2. We used a greedy algorithm finding the best tag combination at
each stage. We found that a tagset size of 23 (formed by collapsing
the sub-categories of the four major categories in the original) gave
the best results. The following results show the results comparing
the original, the 23 size set and sets of size 3 and 2.
only distinguishes words from punctuation, and
distinguishes
content words, function words and punctuation. An ngram of length 6
was used throughout (see below).
In general our experiments showed that the optimal tagset size is between 15 and 25. Our standard tagset of 23 could be reduced slightly with a small improvement by combining rare tags (e.g. fw, foreign word) into the major categories.