Scavenger: A junk mail classification program
Rohan V. Malkhare
Masters Thesis, Computer Science and Engineering
Department, University of South Florida, 2004.
Abstract
The problem of junk mail, also called spam, has
reached epic proportions and various efforts are underway to fight spam.
Junk mail classification using machine learning techniques is a key method
to fight spam. We have devised a machine learning algorithm where features
are created from individual sentences in the subject and body of a message
by forming all possible word-pairings from a sentence. Weights are assigned
to the features based on the strength of their predictive capabilities
for spam/legitimate determination. The predictive capabilities are estimated
by the frequency of occurrence of the feature in spam/legitimate collections
as well as by application of heuristic rules. During classification, total
spam and legitimate evidence in the message is obtained by summing up the
weights of extracted features of each class and the message is classified
into whichever class accumulates the greater sum. We compared the algorithm
with the popular naïve-bayes algorithm and found its performance exceeded
that of the naïve-bayes algorithm both in terms of catching spam and
for reducing false positives.