Luyckx et al, CLIN 2004

From ScribbleWiki: Analysis of Social Media

Shallow text analysis and machine learning for authorship analysis

Authors: Kim Luyckx and Walter Daelemans

Paper: [1]

The authors of this paper work with a corpus consisting of newspaper articles about national current affairs by different journalists. With this narrow corpus many features are kept roughly constant, allowing them to focus on the use of syntax-based and token-based features as predictors for an author's style. The main idea is that these stylistic characters are not under the author's conscious control, and are therefore good clues for authorship attribution.

In this paper, authorship attribution is viewed as a text categorization problem. Applications based on specific features of the authors are not explored. The authors classify documents based on four categories of features: token-level features (e.g. word length, syllables, n-grams), syntax-based features (e.g. part-of-speech tags, rewrite rules), features based on vocabulary richness (e.g. type-token ration, hapax legomena), and common word frequences. They particularly compare the usefulness of token-level, lexical, and syntax-based features.

The authors go into a good amount of detail on the specifics of the features they use.

Feature sets used:

pos: the frequency distribution of parts-of-speech (POS)
verb B: the frequency distribution of basic verb forms
verb: the frequency distribution of verb forms
pat num: the frequency distribution of specific Noun Phrase patterns
function: the frequency distribution of the fourty most frequent function words
lex: the frequency distribution of the twenty most informative words accord-ing to the Rainbow program
read: the readability score
all: a combination of all features
syntax: a combination of all syntax-based features and the token-level feature read

Example list of POS tags in feature set and mean frequency per text in different author classes.

POS tag	Explanation	Frequency
		A-class	B-class	O-class
ADJ	adjectives	35	39	41
BW	adverbs	35	30	34
LET	punctuation	79	64	73
LID	articles	59	63	66
N	nouns	121	118	137
SPEC	proper nouns	24	23	20
TSW	interjections	0.3	0.1	0.14
TW	numerals	8	7	14
VG	conjunctions	20	18	25
VNW	pronouns	50	38	48
VZ	prepositions	66	68	78
WW	verbs	81	76	89

The authors used the memory-based learner TiMBL to do their evaluations using the different feature sets, finding that syntax-based features were the best category for attribution, with an F-score of 57.3%. When using all the features sets in conjunction, they achieved a mean F-score of 72.6%.

Data sets	Author classes			Average
	A-class	B-class	O-class
pos	43.3%	54.9%	44.9%	47.7%
verb B	53.8%	43.8%	27.6%	41.7%
verb	43.6%	46.9%	34.5%	41.7%
pat num	53.2%	50.0%	35.6%	46.3%
function	65.7%	55.7%	43.1%	54.8%
lex	44.4%	59.4%	51.2%	51.7%
read	62.9%	53.3%	36.4%	50.9%
all	77.6%	74.7%	65.5%	72.6%
syntax	59.4%	61.7%	50.9 %	57.3%

Retrieved from "http://socialmedia.scribblewiki.com/Luyckx_et_al%2C_CLIN_2004"

Luyckx et al, CLIN 2004

From ScribbleWiki: Analysis of Social Media

Shallow text analysis and machine learning for authorship analysis

Views

Personal tools

Navigation

Search

Toolbox