|
ALADDIN Workshop on Graph Partitioning
in Vision and Machine Learning: OPEN PROBLEMS
This page is intended as a repository of open problems or questions
brought up by discussions at the workshop, or problems that workshop
attendees would like to publicize and have others work on. These may
not necessarily be fully-specified problems: in some cases the
"problem" may be to find the right way to ask the question. To submit
a problem, send email to
Avrim Blum.
|
Open Problems: titles |
- Graph cuts in
space: linear separators in the
presence of connected examples.
- Motivated by:
word-sense disambiguation, simpler version of
setting from Tom Mitchell's talk.
- Submitted by: Avrim Blum
- Co-training with
linear separators.
- Motivated by: Tom Mitchell's talk, co-training. A more
general version of (1).
- Submitted by: Avrim Blum
- Pinned random fields.
- Motivated by: MRF formulations of problems in learning and vision.
general version of (1).
- Submitted by: John Lafferty
|
Open Problems: details |
- Graph cuts in
space: linear separators in the
presence of connected examples.
- Motivated by:
word-sense disambiguation, simpler version of
setting from Tom Mitchell's talk.
- Submitted by: Avrim Blum
The problem:
We are given a collection of points (examples) in n-dimensional space, some
labeled "+", some labeled "-", and some not labeled at all. In
addition, some of the points are connected by edges. The goal is to
find a linear separator with the positive labeled examples on one
side, the negative labeled examples on the other, and that cuts as few
of the edges as possible.
Alternatively, we can think of the input
as just a collection of labeled points, together with some
unlabeled convex bodies, and the goal is to separate the points while
cutting as few of the convex bodies as possible.
Another view of this problem is that we want to find a minimum s-t
cut in a graph, but the vertices
of the graph are located at points
in space, and the cut must
correspond to a linear separator
in this space.
If this problem is too hard (see below) we could replace "linear
separator" with some other "simple" type of cut (axis-parallel is too
simple), or perhaps have some algorithm that approximates the notion
of "simple" (e.g., finds a quadratic that does nearly as well as the
best linear split).
Where is this coming from? This is a model of the problem of
co-training where one side is using a linear separator, and the other
side is doing rote-learning or nearest neighbor learning (which can be
viewed as just connecting
examples together from the point of view of
the linear separator). For
example, if we think of
word-sense-disambiguation, the
local context surrounding some
word like "bat" can be viewed as
a point in space, and then several
different instances of "bat" in
the same document can be viewed as
saying we believe all those points
ought to be on the same side of
our decision boundary (even if we
don't know which side that is).
What's known? Abie Flaxman showed me a proof that determining
if there is a perfect solution (that cuts no edges) is NP-hard. But
perhaps mileage can be made by adding additional assumptions about the
data. Also, in some sense, the extra information we have (the edges)
should only be helping us in getting closer to the true
target function. So, perhaps one could use a measure that studies how
much this edge information is helping in terms of getting us closer to
the target function than we would be if we didn't have the connected
unlabeled data.
- Co-training with linear separators.
- Motivated by: Tom Mitchell's talk, co-training. A more
general version of (1).
- Submitted by: Avrim Blum
The problem: Given a set S_1 of labeled and unlabeled examples
in n-dimensional space, and another set S_2 of labeled and unlabeled
examples in n-dimensional space. The goal is to find a good linear
separator h_1 for S_1, and another good linear separator h_2 for S_2,
but there is a catch: each unlabeled point x in S_1 is associated with
some unlabeled point f(x) in S_2. We add a penalty to hypotheses
(h_1, h_2) for each x in S_1 such that h_1(x) != h_2(f(x)).
Where is this coming from? This is the co-training problem
discussed in Tom Mitchell's talk,
where we are explicitly using linear
separators as decision boundaries. In
Tom's talk, a "soft" version of this
was considered, which allowed
gradient-descent to be used for
finding a local optimum of the
objective function.
What's known? This is a harder problem than problem (1) above,
because having f(x)=f(y) is like
having examples x and y connected by
an edge, and having f(x) and f(y) have
a disjoint set of nonzero coordinates
is like having them not be
connected by an edge.
Nonetheless, it would be interesting to be able to give some kind of
performance guarantee, perhaps making
some assumptions about the data. For
example, the assumption of
"independence given the label" made
in [Blum-Mitchell98] is like an
assumption that the function f
is a random matching subject to only
connecting points on the same
side of the target function.
- Pinned random fields.
- Motivated by: MRF formulations of problems in learning and vision.
general version of (1).
- Submitted by: John Lafferty
The problem: Given a weighted graph, we consider +1/-1 labelings of
the nodes where some fixed subset is "pinned" to +1, and some other
set is "pinned" to -1. The energy E(x) of such a labeling is the
weighted sum over all pairwise distances (x_i - x_j)^2, and the random
field is given by the Gibbs distribution p(x) = exp(-beta * H(x))/Z,
where the normalizing constant Z involves the sum over all pinned
configurations.
Question: Can Z be efficiently approximated? Can the marginal
probability p(x_i) be computed? Possibly in the "zero temperature
limit" beta -> infty?
Where is this coming from? The discrete random field
formulation in vision and machine learning usually computes
MAP solutions, which correspond to st-mincuts. But there
may be lots of these cuts, and they're sensitive to small
changes in the graph. Computing the marginal p(x_i) averages
over many cuts, and would give the Bayes classifier for labeling
a single example rather than the entire collection.
What's known? Apparently not much is known about this problem, even
in the zero temperature limit, which corresponds to averaging over
st-mincuts. The Jerrum-Sinclair sampling scheme for the Ising model
does not appear to extend to this case. The problem is not even known
to be NP-hard.
Note: These problems disappear in the "relaxation" to non-integral
x_i values, corresponding to Gaussian random fields, where the
partition function is computed as a determinant. This approach
has recently been exploited in machine learning.
|
|