Keselj et al, PACLING 2003
From ScribbleWiki: Analysis of Social Media
N-gram-based author profiles for authorship attribution
Authors: Vlado Keselj, Fuchun Peng, Nick Cercone, and Calvin Thomas
Paper: [1]
The authors of this paper revisit the merits of authorship attribution based on character level n-gram author profiles. They feel that if applied correctly, it can have better performance than standard approaches.
The standard approach (as defined by the authors) consists of two parts. First, some technique is used to extract style markers. After the extraction is done, some classification procedure is applied to the resulting description.
Though this approach is commonly used, the authors note some issues with it. For example, the techniques used for style marker extraction are almost always language dependent, and typically vary dramatically from language to language. Another major issue they mention is that feature selection is not a trivial process, and can involve a little guess-work in setting thresholds.
In the paper, the author's propose a method for building a byte-level n-gram author profile of an author's writing. Thus they do use any language-dependent information. As expected, more frequent n-grams are given more weight in the profiles that are generated.
The statistics the authors collected are extremely positive. Over a set of 8 different authors, they manage to generate 100% accuracy on some tests. Their average accuracy is in the 80%-90% range, which is still impressive. They test three different languages: English, Greek and Chinese, and have similar "state-of-the-art" performance on each.