Abstract:
Support vector machine (SVM) learning algorithms focus on finding the
hyperplane that maximizes the margin (the distance from the separating
hyperplane to the nearest examples) since this criterion provides a good
upper bound of the generalization error. When applied to text
classification, these learning algorithms lead to SVMs with excellent
precision but poor recall. Various relaxation approaches have been proposed
to counter this problem including: asymmetric SVM learning algorithms (soft
SVMs with asymmetric misclassification costs); uneven margin based learning;
and thresholding. A review of these approaches is presented here. In
addition, we describe a new threshold relaxation algorithm.
This approach builds on previous thresholding work based upon the beta-gamma
algorithm. The proposed thresholding strategy is parameter free, relying on
a process of retrofitting and cross validation to set algorithm parameters
empirically, whereas our previous approach required the specification of two
parameters (beta and gamma). The proposed approach is more efficient, does
not require the specification of any parameters, and similarly to the
parameter-based approach, boosts the performance of baseline SVMs by at
least 20% for standard information retrieval measures.
This is joint work by James G. Shanahan and Norbert Roma. |
Speaker Bio:
Dr. James G. Shanahan is Senior Research Scientist at Clairvoyance
Corporation where he heads the Filtering and Machine Learning Group. At
Clairvoyance Corp, he is actively involved in developing cutting-edge
information management systems that harness information retrieval,
linguistics, text/data mining and machine learning. Prior to joining
Clairvoyance, he was a research scientist at Xerox Research Center Europe
(XRCE), Grenoble, France, where, as a member of the Co-ordination
Technologies Group, he developed and patented new document-centric
approaches to information access (known as Document Souls). Before joining
Xerox Research, he completed his PhD in 1998 at the University of Bristol in
fuzzy probabilistic approaches to machine learning. He has extensive
industrial experience both at the AI group at Mitsubishi in Tokyo, Japan,
and at the satellite-scheduling group of the Iridium project at Motorola,
Phoenix, AZ (over 5 years).
Dr. Shanahan has published three books in the area of fuzzy probabilistic approaches to machine learning including a book on knowledge discovery -- "Soft computing for knowledge discovery: Introducing Cartesian granule features". In addition he has authored over 40 research publications and has twelve pending patents. He is on the editorial board of the Journal of Automation and Soft Computing. He has been a member of the program committee in numerous international conferences and workshops and is an active journal reviewer. He is co-organizer of the AAAI Spring Symposium, EAAT, on Affect and Opinion Modeling (in Stanford 2004). He is a member of IEEE and ACM. His research interests include Information Management Systems, Text Retrieval and Filtering, Support Vector Machines, Probabilistic Learning (Expectation Maximisation, Nave Bayes, Bayesian Networks, HMMs, Language Modeling), Clustering, and Uncertainty Modeling. |