Lin et al, AIRWeb 2007
From ScribbleWiki: Analysis of Social Media
Contents |
Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics
Authors: Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle L. Tseng
Reviewer: Yimeng
Overview
This paper detects spam blogs (splog) by exploring the unique features of blogs from common web pages. The unique feature used are temporal and structural regularity of content, post time and links. These features are based on the observation that a blog is a dynamic, growing sequence of entires or posts rather than a collection of individual pages. By combining these features with some content based features that are used in web spam detection (Ntoulas et al, WWW 2006), the authors showed that the performance of detection is improved than only using the content features.
Uniqueness of splogs from web spams
1. Dynamic content: Unlike web spams where content is static, a splog continuously generates fresh content to attract traffic.
2. Non-endorsement link: The hyperlinks in web pages are often interpreted as an endorsement of other pages. Since most blogs have editable area open for readers to contribute, spam can easily create links in normal blogs. Therefore, trust propagation will work poorly for blogs.
Self similarity
All the unique features introduced in this paper are based on the self similarity within a blog, which is the similarity between its own posts of the blog. By calculating the similarity between each posts, a self-similarity matrix <math>S</math> is constructed. <math>S(i,j)</math> is the similarity of post i and j. In order to keep the temporal information, posts are ordered in time in a matrix, that is, <math>i</math> is posted earlier than <math>j</math> if <math>i < j</math>. Three attributes are considered: post time, content and links. For each of them, a similarity matrix is generated.
1. Post time: the time stamp of a post. This tries to measure the regularity of post time. Two similarity measure is introduced. One is called micro time, which is the mode of difference between the time of two posts divided by the time of a day. The other is called macro time, which is the difference between the time of two posts.
2. Post content: The similarity measure on content is defined as the histogram intersection on the tf-idf vectors of two posts.
3. Post Links: The similarity on links is measured in the same way as post content, while the tf-idf vectors are calculated on the counts of target links rather than words.
Self similarity as Temporal features
After calculating the self similarity matrix, the authors provided the methods to use them as features for classification.
1. Regularity features
These features measure the regularity of the self similarity matrix. Two types of patterns are considered.
1) Features along the off-diagonals: the mean, standard deviation and entropy along the <math>k^{th}</math> off-diagonal of the self-similarity matrix are used. Specifically the expectation along the <math> k^{th} </math> off-diagonal is the measure of average similarity of a post to another post, with k-1 posts in between.
2) Features from coherent blocks: clusters are generated from the self-similarity matrix in order to build blocks. Same as the off-diagonal features, mean, standard deviation, and entropy of blocks are used.
2. Joint features
The intuition of the joint features are that changes in different attributes (eg. content, link) are usually coincident for normal blogs. Joint features are computed as the joint entropy of two variables, which can be the same <math>k^{th}</math> off-diagonals or blocks of two self-similarity matrices of different attributes.
Experiment
Dataset
TREC (the Text Retrieval Conference) Blog-Track 2006 dataset is used. Blogs and splogs are manually labeled. After labeling, 800 splogs and 800 normal blogs are randomly selected for experiment.
Detection Performance
Temporal features are combined with content based features as input to a classifier for detection. Support Vector Machine with radial basis function kernel is used. Fisher linear discriminant analysis is used to select features from the content based features. The results are as follows. base-n means that n features are selected from the content features. R represents temporal features.