Datasets and project
suggestions: Below are descriptions of several data sets,
and some suggested projects. The first few are spelled out in greater
detail. You are encouraged to select and flesh out one of these
projects, or make up you own well-specified project using these datasets. To
work on alternative datasets has to been approved by the instructors.
Summary
time series: A:fMRI F:bodymedia G:NBA
relational data: B:IMDB C:DBLP E:creditCard
text data: D:Newsgroup H:WebKB
This data set contains a time series of images of brain
activation, measured using fMRI, with one image every 500 msec. During this
time, human subjects performed 40 trials of a sentence-picture comparison
task (reading a sentence, observing a picture, and determining whether the
sentence correctly described the picture). Each of the 40 trials lasts
approximately 30 seconds. Each image contains approximately 5,000 voxels (3D
pixels), across a large portion of the brain. Data is available for 12
different human subjects.
Available software: we can provide
Matlab software for reading the data, manipulating and visualizing it, and
for training some types of classifiers (Gassian Naive Bayes, SVM).
Project
A1: Bayes network classifiers for
fMRI
Project idea: Gaussian Naïve Bayes
classifiers and SVMs have been used with this data to predict when the
subject was reading a sentence versus perceiving a picture. Both of these
classify 8-second windows of data into these two classes, achieving around
85% classification accuracy [Mitchell et al, 2004]. This project will
explore going beyond the Gaussian Naïve Bayes classifier (which assumes
voxel activities are conditionally independent), by training a Bayes network
in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to
confront include which features to include (5000 voxels times 8 seconds of
images is a lot of features) for classifier input, whether to train
brain-specific or brain-independent classifiers, and a number of issues
about efficient computation with this fairly large data set. Midpoint
milestone: By Nov 8 you should have run at least one classification
algorithm on this data and measured its accuracy using a cross validation
test. This will put you in a good position to explore refinements of the
algorithm, alternative feature encodings for the data, or competing
algorithms, by the end of the semester. Project: Reducing dimensionality and
classification accuracy.
Papers to read: "Learning
to Decode Cognitive States from Brain Images," Mitchell et al., 2004, "Bayesian
Network Classifiers" Friedman et al., 1997.
Project A2: Dimensionality reduction
for fMRI data
Project idea: Explore the use of
dimensionality-reduction methods to improve classification accuracy with
this data. Given the extremely high dimension of the input (5000 voxels
times 8 images) to the classifier, it is sensible to explore methods for
reducing this to a small number of dimension. For example, consider PCA,
hidden layers of neural nets, or other relevant dimensionality reducing
methods. PCA is an example of a method that finds lower dimension
representations that minimize error in
reconstructing the data. In contract, neural network hidden layes
are lower dimensional representations of the inputs that
minimize classification error (but
only find a local minimum). Does one of these work better? Does it depend
on parameters such as the number of training examples?
Papers to read: "Learning
to Decode Cognitive States from Brain Images," Mitchell et al., 2004,
papers and textbook on PCA, neural nets, or whatever you propose to try.
Project A3: Feature selection/feature
invention for fMRI classification.
Project idea:
As in many high dimensional data sets, automatic selection of
a subset of features can have a strong positive impact on classifier
accuracy. We have found that selecting features by the difference in their
activity when the subject performs the task, relative to their activity
while the subject is resting, is one useful strategy [Mitchell et al.,
2004]. In this project you could suggest, implement, and test alternative
feature selection strategies (eg., consider the incremental value of adding
a new feature to the current feature set, instead of scoring each feature
independent of other features that are being selected), and see whether you
can obtain higher classification accuracies. Alternatively, you could
consider methods for synthesizing new features (e.g., define the 'smoothed
value' of a voxel in terms of a spatial Gaussian kernel function applied to
it and its neighbors, or define features by averaging voxels whose time
series are highly correlated).
Papers to read: "Learning
to Decode Cognitive States from Brain Images," Mitchell et al., 2004,
papers on feature selection
This data set consists of two main parts:
IMDB data: a movie database consists of many different attributes about movies. For example, movie title, genre, actors/actresses, directors, company, year, etc.
User rating data: user ratings on different movies
Project ideas:
Project B1: Classification: To predict the user rating based on movie information. This involves feature selection and classification.
Project B2: Clustering evolution: a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. A interesting study will be to model the changes of the clusters over time.
This data set consists of computer science bibliography data. The data is well-organized and can be downloaded at http://dblp.uni-trier.de/xml/. An interesting browser for viewing this dataset is available too.
Project ideas:
Project C1: clustering evolution: a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. A interesting study will be to model the changes of the clusters over time.
Project C2: distance measure study: a good distance function is crucial for the success of any learning algorithm. It is especially true for heterogeneous dataset like this, where naive distance function such as Euclidian distance is undefined.
This data set contains 1000 text articles posted to each of
20 online newgroups, for a total of 20,000 articles. For documentation and
download, see
this website. This data is useful for a variety of text classification
and/or clustering projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are hierarchically
organized (e.g., "sports", "hockey").
Available software: The same website
provides an implementation of a Naive Bayes classifier for this text data.
The code is quite robust, and some documentation is available, but it is
difficult code to modify.
Project ideas:
EM for text classification in the case where you have
labels for some documents, but not for others (see McCallum et al, and
come up with your own suggestions)
Make up your own text learning problem/approach
This dataset comes from a real data mining competition recently (http://mill.ucsd.edu/).
When a company is
evaluating whether an individual is a 'good' or 'bad' customer, it uses
historical information from that customer's account. For example, a credit
card company might be interested in identifying customers that are likely to
go bankrupt. The company will use past transaction information to predict
future bankruptcy. Once a potentially 'bad' account is predicted, the
company will take additional steps to verify the actual nature of the
account.
Our goal is to develop a system that uses historical data and accurately
predicts which accounts are likely to be 'bad.'
There is a cost for the company if we inaccurately
predict an account to be a 'good' account. For example, the credit
card company will have to pay for a customer's bankruptcy. There is
also a cost if we inaccurately predict an account to be 'bad'. In
our example, the company might launch a costly investigation or
prematurely cut off a good customer's account. Also, we are
interested in detecting 'bad' behavior as soon as possible. For
example, suppose a customer unknowingly has her identity stolen. We
want to take action, such as calling the customer, as soon as
possible.
For each customer, there is a time series of between
1 and 10,000 records.
Each record contains 41 pieces of information:
The first value is the account id.
The second value is the record id. These are consecutive in time, but not sampled at any regular intervals.
Values 3 through 41 are data (boolean, real, integer) associated with each record.
The training data also has a 42nd value which is the record label.
You don't need any specific domain knowledge about
values 3 through 41 to solve the problem. This is where machine
learning is useful.
Each record has a binary record label. This label is '1' for
'bad' and '0' for 'good'. The first record of an account can either
start out labeled as 'good' or 'bad', but once there is a 'bad'
record, all the following records for the account will also have a
'bad' label. A 'bad' account is one which has at least one record
with a 'bad' label.
There are two separate competitions:
Project F1:
Classification Task
You are given the account information for a number of customers and
must predict who are the 'bad' customers (i.e. customers that
have accounts with at least one 'bad' record label).
Project F2: Time Series
Analysis Task
You are given account information for a number of customers and must
determine when the customer becomes 'bad' (i.e. when the
first 'bad' record occurs).
Note that these tasks are not independent of one another.
Datasets can be downloaded at http://mill.ucsd.edu/index.php?page=Datasets&subpage=AllData
F: Physiological Data Modeling (bodymedia)
Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.
characteristic1 | age |
characteristic2 | handedness |
sensor1 | gsr_low_average |
sensor2 | heat_flux_high_average |
sensor3 | near_body_temp_average |
sensor4 | pedometer |
sensor5 | skin_temp_average |
sensor6 | longitudinal_accelerometer_SAD |
sensor7 | longitudinal_accelerometer_average |
sensor8 | transverse_accelerometer_SAD |
sensor9 | transverse_accelerometer_average |
Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/
Project idea:
Project G1: behavior classification; to classify the person based on the sensor measurements.
This download contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season
Project idea:
Project H1: outlier detection on the players; find out who are the outstanding players.
Project H2: predict the game outcome.