Kolari et al, AAAI 2006
From ScribbleWiki: Analysis of Social Media
Detecting Spam Blogs: A Machine Learning Approach
The paper applies SVM to detecting splogs (aka. spam blogs), which are are faked weblogs with machine generated or hijacked content, whose sole purpose is to promote affiliated websites or to increase PageRank of associated sites.
The idea of detecting spam on the web is not new, nor the idea of training statistical models to do the classification between spams and non-spams. However, the contribution of the paper is that it formalizes the problem of using machine learning approach to detect blog spam, and introduces new features specifically for detecting splog (vs. spam webpage), and validates their importance using SVM.
Basically the paper introduces two types of features: local feature and global feature. A local feature is one that is completely determined by the contents of a single web page: bag-of-words with out stemming (as both binary and TFIDF features), bag-of-word-N-Grams(N = 2 and 3), bag-of-anchors(tokenized anchor text) feature, bag-of-urls (tokenized URLs) features, and a set of 13 language model motivated or heurisitcs based "specialized features", such as Location/Person/Organization Entity Ratio,various Compression Ratio, Hyphens compared with number of URLs, etc.
By contrast, a global (non-local) feature taps into information beyond the content of Web page under test. In the paper, link analysis is used to extract global features that capture relations among blogs, splogs and other web resources, with the intuition that "authentic blogs are very unlikely link to splogs and that splogs frequently do link to other splogs". To do this, the local model is first used to identify a seed set of blogs(which the model is most confident to be authentic ones) and splog(which the model is most confident to be faked ones) with some cut-off threshold. Then other blogs are classified, based on their in-link and out-link relationship to these authentic blogs and splogs (encoded as 15 link features), again using SVM.
The result is somewhat surprising: the combination of simple binary bag-of-words features with a linear kernel is the most effective,yielding an AUC values of as high as 0.95. As to those seemingly more "intelligent" "specialized features" , results from using these features together were significantly less effective than the standard features. The global model also under-performs the bag-of-words model; and combining link features of the seed set with their bag-of-words did not improve accuracy beyond bag-of-words taken alone. The authors attribute this to nature of splogs, and conclude that local textual content is arguably the most important discriminating feature for current splogs detection. However, they also argue that with the evolution in sploggers' strategies and tactics, some of the more complex features may become useful.