CMU - IR Discussion Series

Thursday, October 30, 2003 - 12:00, NSH 4632
Boosting Support Vector Machines for Text Classification through Parameter-free Threshold Relaxation
Dr. James G. Shanahan
Slides: pdf ps

Abstract:

Support vector machine (SVM) learning algorithms focus on finding the hyperplane that maximizes the margin (the distance from the separating hyperplane to the nearest examples) since this criterion provides a good upper bound of the generalization error. When applied to text classification, these learning algorithms lead to SVMs with excellent precision but poor recall. Various relaxation approaches have been proposed to counter this problem including: asymmetric SVM learning algorithms (soft SVMs with asymmetric misclassification costs); uneven margin based learning; and thresholding. A review of these approaches is presented here. In addition, we describe a new threshold relaxation algorithm. This approach builds on previous thresholding work based upon the beta-gamma algorithm. The proposed thresholding strategy is parameter free, relying on a process of retrofitting and cross validation to set algorithm parameters empirically, whereas our previous approach required the specification of two parameters (beta and gamma). The proposed approach is more efficient, does not require the specification of any parameters, and similarly to the parameter-based approach, boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.
This is joint work by James G. Shanahan and Norbert Roma.

Speaker Bio:

Dr. James G. Shanahan is Senior Research Scientist at Clairvoyance Corporation where he heads the Filtering and Machine Learning Group. At Clairvoyance Corp, he is actively involved in developing cutting-edge information management systems that harness information retrieval, linguistics, text/data mining and machine learning. Prior to joining Clairvoyance, he was a research scientist at Xerox Research Center Europe (XRCE), Grenoble, France, where, as a member of the Co-ordination Technologies Group, he developed and patented new document-centric approaches to information access (known as Document Souls). Before joining Xerox Research, he completed his PhD in 1998 at the University of Bristol in fuzzy probabilistic approaches to machine learning. He has extensive industrial experience both at the AI group at Mitsubishi in Tokyo, Japan, and at the satellite-scheduling group of the Iridium project at Motorola, Phoenix, AZ (over 5 years).
Dr. Shanahan has published three books in the area of fuzzy probabilistic approaches to machine learning including a book on knowledge discovery -- "Soft computing for knowledge discovery: Introducing Cartesian granule features". In addition he has authored over 40 research publications and has twelve pending patents. He is on the editorial board of the Journal of Automation and Soft Computing. He has been a member of the program committee in numerous international conferences and workshops and is an active journal reviewer. He is co-organizer of the AAAI Spring Symposium, EAAT, on Affect and Opinion Modeling (in Stanford 2004). He is a member of IEEE and ACM.
His research interests include Information Management Systems, Text Retrieval and Filtering, Support Vector Machines, Probabilistic Learning (Expectation Maximisation, Nave Bayes, Bayesian Networks, HMMs, Language Modeling), Clustering, and Uncertainty Modeling.