Perhaps the most salient deficiency of conventional language models is their complete failure at modeling semantic coherence. These models capture fairly well short distance correlations among words in a sentence, yet are unable to distinguish meaningful sentences where the content words come from the same semantic domain from 'fake' sentences where content words are drawn randomly. As a result, in many language technology applications such as speech recognition, errors that are obvious to a human observer (eg a noun replaced by an acoustically similar but semantically different noun) cannot be salvaged by the model.
The whole-sentence exponential language model developed recently by our group is naturally suited to modeling whole-sentence phenomena such as semantic coherence. In previous work we have shown that more benefit can be expected from a handful of features that are frequently active (as opposed to many but rarely active features). Ideally, we would like to derive a single computational feature that captures the notion of semantic coherence in a sentence or document.
Building on previous work by Can Cai, we discovered significant differences in the distribution of content words between real text and model-generated ('fake') text. Specifically, for each sentence, all content words were identified. For each pair in that set, we estimated a measure of association called Q (similar to a correlation coefficient) based on the appropriate 2x2 contingency table of training-data co-occurrences of these two words. Thus each sentence, true or fake, can be represented by a (variable length) list of Q values. By defining 4 features of these lists (their min, max, median and mean), Can showed that the 'true' Q lists have a different distribution than the 'fake' Q lists, and by plugging these very simple features into the exponential model she achieved a performance improvement.
However, we believe that much more improvement is possible, and can be realized by learning a much more powerful feature from the Q list. The goal of this project is to automatically learn such a feature from data.
The training data for this task is a very large set of Q lists, each such list classified as 'true' (1) or 'false' (0). The goal is to find a single function, that takes as input a Q list and produces a single number between 0 and 1 as output, predicting whether the list came from a 'true' or 'fake' sentence. One of the main ML challenges here is dealing with variable length inputs.