Fuzzy Text Segmentation / Sequence Matching

This is a nice little side project arising from my thesis work. Say I randomly pick some sentences from a predefined sentence list, then concatenate them to form a long word sequence. The task is to do the reverse: given a word sequence, re-identify those sentences. To make it more interesting (and more realistic), I will throw in a little bit of noise to each sentence. The keyword here is "a little", i.e. the edit distance (Levinshtein) between an original sentence and the one in the word sequence is small (in most cases 0).

Example:

 ... like finding a proper nursing home | in doing it when it got to
the point that we had to do it | uh everybody ...
Fuzzy text segmentation is probably not a good name for this problem: fuzzy sequence matching, linear sequence alignment? If you somehow end up here by googling, let me know what keywords you typed in. Note neither is this a sentence boundary detection problem, since we could be dealing with sentence fragments and who knows what.

We need both to segment the word string and to match with known sentences. Hence it makes a perfect search project. I wrote a beam search Perl script, and actually got to deal with most search issues.

The original problem I had is to align two different versions of phonetically transcribed sentence fragments: one with sentence boundary markers, the other without. To get an idea of the scale of the problem, there are more than 5000 sentences and the word sequence has 38k words. My search code burns some major CPU cycles (I blame Perl for that ;-), but gives reasonable results, and it's quite robust. Some sentences are quite short, so a unique segmentation is not guaranteed.

If you are interested, drop me a line so we can discuss.


Yi commented that bioinfomatics folks do similar things called multi-sequence alignment. It seems they are pretty serious about it!
Hua Yu
Last modified: Sun Feb 23 20:20:14 EST 2003