Luyckx et al, CLIN 2004
From ScribbleWiki: Analysis of Social Media
Shallow text analysis and machine learning for authorship analysis
Authors: Kim Luyckx and Walter Daelemans
Paper: [1]
The authors of this paper work with a corpus consisting of newspaper articles about national current affairs by different journalists. With this narrow corpus many features are kept roughly constant, allowing them to focus on the use of syntax-based and token-based features as predictors for an author's style. The main idea is that these stylistic characters are not under the author's conscious control, and are therefore good clues for authorship attribution.
In this paper, authorship attribution is viewed as a text categorization problem. Applications based on specific features of the authors are not explored. The authors classify documents based on four categories of features: token-level features (e.g. word length, syllables, n-grams), syntax-based features (e.g. part-of-speech tags, rewrite rules), features based on vocabulary richness (e.g. type-token ration, hapax legomena), and common word frequences. They particularly compare the usefulness of token-level, lexical, and syntax-based features.
The authors go into a good amount of detail on the specifics of the features they use.
Feature sets used:
- pos: the frequency distribution of parts-of-speech (POS)
- verb B: the frequency distribution of basic verb forms
- verb: the frequency distribution of verb forms
- pat num: the frequency distribution of specific Noun Phrase patterns
- function: the frequency distribution of the fourty most frequent function words
- lex: the frequency distribution of the twenty most informative words accord-ing to the Rainbow program
- read: the readability score
- all: a combination of all features
- syntax: a combination of all syntax-based features and the token-level feature read
Example list of POS tags in feature set and mean frequency per text in different author classes.
POS tag | Explanation | Frequency | ||
A-class | B-class | O-class | ||
ADJ |
adjectives |
35 |
39 |
41 |
BW |
adverbs |
35 |
30 |
34 |
LET |
punctuation |
79 |
64 |
73 |
LID |
articles |
59 |
63 |
66 |
N |
nouns |
121 |
118 |
137 |
SPEC |
proper nouns |
24 |
23 |
20 |
TSW |
interjections |
0.3 |
0.1 |
0.14 |
TW |
numerals |
8 |
7 |
14 |
VG |
conjunctions |
20 |
18 |
25 |
VNW |
pronouns |
50 |
38 |
48 |
VZ |
prepositions |
66 |
68 |
78 |
WW |
verbs |
81 |
76 |
89 |
The authors used the memory-based learner TiMBL to do their evaluations using the different feature sets, finding that syntax-based features were the best category for attribution, with an F-score of 57.3%. When using all the features sets in conjunction, they achieved a mean F-score of 72.6%.
Data sets |
Author classes |
Average |
||
A-class |
B-class |
O-class | ||
pos | 43.3% | 54.9% | 44.9% | 47.7% |
verb B | 53.8% | 43.8% | 27.6% | 41.7% |
verb | 43.6% | 46.9% | 34.5% | 41.7% |
pat num | 53.2% | 50.0% | 35.6% | 46.3% |
function | 65.7% | 55.7% | 43.1% | 54.8% |
lex | 44.4% | 59.4% | 51.2% | 51.7% |
read | 62.9% | 53.3% | 36.4% | 50.9% |
all | 77.6% | 74.7% | 65.5% | 72.6% |
syntax | 59.4% | 61.7% | 50.9 % | 57.3% |