Problem 1: Write a spell checker that loads a dictionary as a lexical tree.
b--a--t | |--t--t--l--e | |--a--n--a-n--aNote that although the "t" is common to both "bat" and "battle", we actually do NOT share the "t" between the two. This is because the "t" in "bat" is the last character of the word. By not sharing it, we ensure that each leaf of the tree uniquely identifies a word.
*--b--a--t | |--t--t--l--e | |--a--n--a-n--aTo account for the "*" we will also prepend a "*" to the words in our test data. For instance, if we intend to spellcheck the word "brat", we'd expand it first to "*brat". Now both deletions and insertions in the first position can be accommodated.
*--a | |--a--n | |--n--d | |--p--p--l--e | |--b--a--t | |--t--t--l--e | |--a--n--a-n--aNote that "a" is not shared with "an", "and" or "apple". Similarly, the "n" in "an" is not shared with "and". This is to ensure that each leaf of the tree represents a unique word.
Problem 2: Use the lextree structure to also automatically segment the text in this file and this file to find word boundaries (and insert spaces) in the right places and (for the second file) simultaneously spellcheck the words in it. To do this, permit a transition back from the leaves of the lextree back to the "*". The location of "*" in the best path identifies word boundaries.
Note that the procedure above is likely to make many errors. Can you think of any variation to the procedure that may result in better segmentation?
For problem 2, try relative beam widths of 5,10, and 15. Compare the segmentation and corrected spellings in this file to determine which works best. The "accuracy" of an output is computed as the difference in the number of words in the hypothesized segmentation and the number of words in the correct transcription PLUS the number of mispelled words. If possible, plot accuracy as a function of beam width.
Due: Wednesday, 21 Mar 2011.