Lin et al, ICME 2007
From ScribbleWiki: Analysis of Social Media
Contents |
Splog Detection using Content, Time and Link Structures
Authors: Yu-Ru Lin, Hari Sundaram, Yun Chi, Jun Tatemura, and Belle Tseng.
Reviewer: Yimeng
Overview and Related Work
This paper aims to detect spam blogs (splog) by making use of the unique features of blogs. The unique feature used are temporal content regularity (self-similarity of content), temporal structural regularity (regular post times), and regularity in linking structure. These features are based on the observation that a blog is a dynamic, growing sequence of entires or posts rather than a collection of individual pages. By combining these features with some content based features that are used in web spam detection (Ntoulas et al, WWW 2006), the authors showed that the performance of detection is improved than only using the content features.
Difference from Previous Work
This paper has very similar favor with Lin et al, AIRWeb 2007. It has the same purpose and intuition behind the features with those used in Lin et al, AIRWeb 2007, but different methods to get those features. Therefore, the detail explanation of the features is omitted here. For those who are interested, please refer to Lin et al, AIRWeb 2007.
Temporal regularity features
1. Temporal Content Regularity (TCR)
The TCR value of a blog is estimated by the autocorrelation between posts of that blog. The intuition is that splogs are highly similar over time, which leads to a high autocorrelation value. However, human bloggers tend to change topics over time, which results in a low autocorrelation value. Let <math>p(l)</math> be the <math>l^{th}</math> post, and <math>p(l+k)</math> be the <math>l+k^{th}</math> post. The autocorrelation function <math>R(k)</math> for blog posts are calculated as follows.
<math>R(k)=1-d(p(l),p(l+k))</math>
<math>d(p(l),p(l+k))=1-E[\frac{\sum^N_i{min(w^l_f(i),w^(l+k)_f(i))}{\sum^N_i{max(w^l_f(i),w^{l+k}_f(i))]</math>
Where <math>w^l_f</math> is the tf-idf vector of word f for post l.
2. Temporal Structural Regularity (TSR)
TSR is computed as the entropy of the post time difference distribution. The post time difference distribution is achieved by first clustering the posts based on the post interval difference values. The probability of a cluster is calculated by the relative counts of posts in that cluster over all posts. The over all distribution is represented by the distribution of the clusters, and then entropy is calculated from the probability of each cluster.
3. Link regularity estimation
LR measures website linking consistency for a blogger. The intuition is that a splog will exhibit more consistent links since the main intent of such splogs is to drive traffic to affiliate websites. Moreover, there would be more links targeted to affiliated websites rather than normal blogs or websites. HITS algorithm is used here. The hub score of a blog is used as the LR measure. A blog graph is constructed between all the blogs in order to calculate the scores of hubs and authorities.
Experiment
Dataset
TREC (the Text Retrieval Conference) Blog-Track 2006 dataset is used. Blogs and splogs are manually labeled. After labeling, 800 splogs and 800 normal blogs are randomly selected for experiment.
Detection Performance
Temporal features are combined with content based features as input to a classifier for detection. Support Vector Machine with polynomial kernel is used. Fisher linear discriminant analysis is used to select features from the content based features. The results are as follows. base-n means that n features are selected from the content features. R represents temporal features.