ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning: Open Problems

ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning: OPEN PROBLEMS

This page is intended as a repository of open problems or questions brought up by discussions at the workshop, or problems that workshop attendees would like to publicize and have others work on. These may not necessarily be fully-specified problems: in some cases the "problem" may be to find the right way to ask the question. To submit a problem, send email to Avrim Blum.

Open Problems: titles

Graph cuts in space: linear separators in the presence of connected examples.
- Motivated by: word-sense disambiguation, simpler version of setting from Tom Mitchell's talk.
- Submitted by: Avrim Blum
Co-training with linear separators.
- Motivated by: Tom Mitchell's talk, co-training. A more general version of (1).
- Submitted by: Avrim Blum
Pinned random fields.
- Motivated by: MRF formulations of problems in learning and vision. general version of (1).
- Submitted by: John Lafferty

Open Problems: details

Graph cuts in space: linear separators in the presence of connected examples.
- Motivated by: word-sense disambiguation, simpler version of setting from Tom Mitchell's talk.
- Submitted by: Avrim Blum
The problem: We are given a collection of points (examples) in n-dimensional space, some labeled "+", some labeled "-", and some not labeled at all. In addition, some of the points are connected by edges. The goal is to find a linear separator with the positive labeled examples on one side, the negative labeled examples on the other, and that cuts as few of the edges as possible. Alternatively, we can think of the input as just a collection of labeled points, together with some unlabeled convex bodies, and the goal is to separate the points while cutting as few of the convex bodies as possible.
Another view of this problem is that we want to find a minimum s-t cut in a graph, but the vertices of the graph are located at points in space, and the cut must correspond to a linear separator in this space.
If this problem is too hard (see below) we could replace "linear separator" with some other "simple" type of cut (axis-parallel is too simple), or perhaps have some algorithm that approximates the notion of "simple" (e.g., finds a quadratic that does nearly as well as the best linear split).
Where is this coming from? This is a model of the problem of co-training where one side is using a linear separator, and the other side is doing rote-learning or nearest neighbor learning (which can be viewed as just connecting examples together from the point of view of the linear separator). For example, if we think of word-sense-disambiguation, the local context surrounding some word like "bat" can be viewed as a point in space, and then several different instances of "bat" in the same document can be viewed as saying we believe all those points ought to be on the same side of our decision boundary (even if we don't know which side that is).
What's known? Abie Flaxman showed me a proof that determining if there is a perfect solution (that cuts no edges) is NP-hard. But perhaps mileage can be made by adding additional assumptions about the data. Also, in some sense, the extra information we have (the edges) should only be helping us in getting closer to the true target function. So, perhaps one could use a measure that studies how much this edge information is helping in terms of getting us closer to the target function than we would be if we didn't have the connected unlabeled data.
Co-training with linear separators.
- Motivated by: Tom Mitchell's talk, co-training. A more general version of (1).
- Submitted by: Avrim Blum
The problem: Given a set S_1 of labeled and unlabeled examples in n-dimensional space, and another set S_2 of labeled and unlabeled examples in n-dimensional space. The goal is to find a good linear separator h_1 for S_1, and another good linear separator h_2 for S_2, but there is a catch: each unlabeled point x in S_1 is associated with some unlabeled point f(x) in S_2. We add a penalty to hypotheses (h_1, h_2) for each x in S_1 such that h_1(x) != h_2(f(x)).
Where is this coming from? This is the co-training problem discussed in Tom Mitchell's talk, where we are explicitly using linear separators as decision boundaries. In Tom's talk, a "soft" version of this was considered, which allowed gradient-descent to be used for finding a local optimum of the objective function.
What's known? This is a harder problem than problem (1) above, because having f(x)=f(y) is like having examples x and y connected by an edge, and having f(x) and f(y) have a disjoint set of nonzero coordinates is like having them not be connected by an edge.
Nonetheless, it would be interesting to be able to give some kind of performance guarantee, perhaps making some assumptions about the data. For example, the assumption of "independence given the label" made in [Blum-Mitchell98] is like an assumption that the function f is a random matching subject to only connecting points on the same side of the target function.
Pinned random fields.
- Motivated by: MRF formulations of problems in learning and vision. general version of (1).
- Submitted by: John Lafferty
The problem: Given a weighted graph, we consider +1/-1 labelings of the nodes where some fixed subset is "pinned" to +1, and some other set is "pinned" to -1. The energy E(x) of such a labeling is the weighted sum over all pairwise distances (x_i - x_j)^2, and the random field is given by the Gibbs distribution p(x) = exp(-beta * H(x))/Z, where the normalizing constant Z involves the sum over all pinned configurations. Question: Can Z be efficiently approximated? Can the marginal probability p(x_i) be computed? Possibly in the "zero temperature limit" beta -> infty?
Where is this coming from? The discrete random field formulation in vision and machine learning usually computes MAP solutions, which correspond to st-mincuts. But there may be lots of these cuts, and they're sensitive to small changes in the graph. Computing the marginal p(x_i) averages over many cuts, and would give the Bayes classifier for labeling a single example rather than the entire collection.
What's known? Apparently not much is known about this problem, even in the zero temperature limit, which corresponds to averaging over st-mincuts. The Jerrum-Sinclair sampling scheme for the Ising model does not appear to extend to this case. The problem is not even known to be NP-hard. Note: These problems disappear in the "relaxation" to non-integral x_i values, corresponding to Gaussian random fields, where the partition function is computed as a determinant. This approach has recently been exploited in machine learning.