Tuesday, September 11, 2018. 12:00PM. GHC 6115.
Matt Barnes -- Learning with Clusters: A cardinal machine learning sin and how to correct for it
Abstract: As machine learning systems become increasingly complex, clustering has evolved from an exploratory data analysis tool into an integrated component of computer vision, robotics, medical and census data pipelines. Currently, as with many machine learning systems, the output of the clustering algorithm is taken as ground truth at the next pipeline step. We show this false assumption causes subtle and dangerous behavior for even the simplest systems -- sometimes biasing results by upwards of 25%.
We provide the first empirical and theoretical study of this phenomenon which we term dependency leakage. Further, we introduce fixes in the form of estimators and methods to both quantify and correct for clustering errors' impacts on downstream learners. Our work is agnostic to the downstream learners, and requires few assumptions on the clustering algorithm. Empirical results demonstrate our approach improves these machine learning systems compared to naive approaches, which do not account for clustering errors.
This talk is based on the following two papers:
The Binomial Block Bootstrap Estimator for Evaluating Loss on Dependent Clusters