CMU Theory Lunch: Steven Ehrlich

Efficient Active and Semi-Supervised Learning of Disjunctions

April 17, 2013

In many modern applications, like web-based information gathering, unlabeled data is abundant but labeling it is expensive. Consequently, there has been substantial effort in understanding and using semi-supervised learning (using large amounts of unlabeled data to augment limited labeled data) and active learning (where the algorithm itself asks for labels of carefully chosen examples with the goal of minimizing the human labeling effort). In order to make inferences from this unlabeled data, it is necessary to (sometimes implicitly) assume some connection between the target concept being learned and the underlying distribution in the form of a regularity assumption.

While good regularity assumptions can information-theoretically reduce the labeled sample complexity, in many cases efficient algorithms with provable guarantees are not known. We provide efficient algorithms learning the class of 2-Sided Disjunctions under modest assumptions, addressing an open problem in the field of semi-supervised learning. We are able to remove these assumptions in the active learning domain and present an efficient algorithm matching the label complexity lower bound.

Joint work with Nina Balcan, Chris Berlind, and Yingyu Liang.