PROBLEMS AND GOALS
FORMAT
SPEAKERS
ABSTRACTS
SCHEDULE
|
Problems
and goals: High dimensionality data is one of the
challenges that machine learning researchers are increasingly often
faced with. This is a consequence of both the increasing volume of the
available data collections and of the pervasion of machine learning
techniques to ever wider application areas. Domains with typically
high data dimensionality are pattern recognition and image processing,
text and language modeling, diagnosis systems, computational biology,
genetics.
Constructing models from data in high dimensional domains raises
problems that are inexistent or less severe in the lower dimensional
cases: Data, models and error surfaces in more than three dimensions
are hard to visualize and to represent intuitively. If the variables
are discrete the size of the state space grows exponentially with the
number of dimensions and may exceed the number of samples available by
many orders of magnitude. As a consequence, overfitting avoidance and
feature selection become critical. Moreover, an increased
dimensionality in the parameter space (in case of parametric models)
may lead to an exponential increase in the computational demands for
finding the optimal set of parameters. Therefore, a special attention
must be paid to algorithmic issues: the need for models that can be
learned efficiently and can be used efficiently is stringent for large
dimensional data sets.
Of the wide existing spectrum of models and machine learning
methods only a few ones are currently applied to high dimensional
problems. Many of these models are among the simplest ones (naive
Bayes, nearest neighbor). The reasons lie in both computational
difficulties and in local minima and model selection problems. The
goal of this workshop is to better understand the sources of
difficulty in training (and using) high-dimensional statistical models
and to expose the participants to recent solutions and to approaches
outside the traditional scope of NIPS (e.g. multiscale models). We
will focus on the algorithmic side of the problem by emphasizing on:
- fast and very fast exact algorithms
- models/training algorithms with no local minima (e.g. support vectors, tree distributions)
- data structures
- approximate and domain-specific algorithms
We hope that the worshop will enable the application of machine
learning techniques to a larger class of significant real world
problems.
The workshop will bring together researchers from domains dealing with
constructing statistical models of high-dimensional data, from data
mining, graphical models and algorithms to discuss issues that are
relevant across different fields. It aims is to make the NIPS
community aware of domain-specific approaches to this problem, of the
typical assumptions that allow learning in various fields of
application. In addition, the workshop will help other communities
understand how learning and the NIPS community can be useful in
solving their problems.
A key goal of the workshop will be to expose researchers to ideas and
open problems like:
- When " quadratic" is not good enough. Very fast algorithms for large problems.
- How to use prior knowledge in speeding up training and search? Lessons from domain specific paradigms.
- How to efficiently prune irrelevant features. Implicit versus explicit feature selection techniques.
- Efficient computation of sufficient statistics. It is known that
in learning the structure of a graphical model a large number of
statistics (cooccurrence counts) must be evaluated and that this is
one of the most computationally intensive stages of the search over
models. What techniques are available for computing/storing the
sufficient statistics efficiently, approximating them, predicting
their values based on other statistics?
- Approximate belief net propagation methods that scale well
- Supervised versus unsupervised training. Often times, one finds
that density estimators perform well in classification/recognition
tasks. What causes this behaviour? Are there lessons to be learned
that would improve classifier training?
Format: This will be
a one day workshop, interspersing short invited talks (20 min) with
moderated discussions. The speakers will be encouraged to talk about
challenges and controversial topics both in their prepared talks and
in the insuing discussions. The choice of topics will be balanced
between problems/challenges and presentation of algorithmic
solutions. For the later, the presentation of new approaches or work
in progress will be especially encouraged. The former topic may
include tutorial material if it refers to fields outside the scope of
the attendees' majority. To maximize the profit for everybody
participating the focus will be on algorithmic issues that are general
or common to several fields and on identifying solutions with
potential of generalization. Since one of the goals of the workshop is
to facilitate communication between researchers in different
subfields, ample time will be given to questions. The last part of the
workshop will be devoted to a discussion of the most promising
approaches and ideas that will have emerged during the workshop.
|
Contact Info
Marina Meila
Smith Hall 208
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh PA 15213
mmp@cs.cmu.edu
Phone:(412)268-8424
Fax:(412)268-5571
Andrew W. Moore
Smith Hall 221
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh PA 15213
awm@cs.cmu.edu
Phone:(412)268-7599
Fax:(412)268-5571
|