Introduced about 10 years ago by Bansal, Blum and Chawla, correlation clustering has become one of the standard techniques in machine learning and data mining. This due to several advantages of correlation clustering as compared to other standard clustering methods (e.g. k-means):
-- Correlation clustering only requires qualitative information about similarities between objects. This makes it applicable in scenarios such as crowdsourced duplicate finding when information about similarities between objects is generated by humans.
-- Correlation clustering doesn't require the number of clusters to be specified in advance, producing the number of clusters that best fits the data.
We give new rounding schemes for the standard linear programming relaxation of the correlation clustering problem, achieving approximation factors almost matching the integrality gaps:
- For complete graphs our appoximation is 2.06 - epsilon for a fixed constant epsilon, which almost matches the previously known integrality gap of 2.
- For complete k-partite graphs our approximation is 3. We also show a matching integrality gap.
- For complete graphs with edge weights satisfying triangle inequalities and probability constraints, our approximation is 1.5, and we show an integrality gap of 1.2.
Our results improve a long line of work on approximation algorithms for correlation clustering in complete graphs, previously culminating in a ratio of 2.5 for the complete case by Ailon, Charikar and Newman (JACM'08). In the weighted complete case satisfying triangle inequalities and probability constraints, the same authors give a 2-approximation; for the bipartite case, Ailon, Avigdor-Elgrabli, Liberty and van Zuylen give a 4-approximation (SICOMP'12).
Joint work with Shuchi Chawla, Konstantin Makarychev and Tselil Schramm.