Text-Driven Forecasting (11-773)

Instructor: Prof. Noah Smith
History: Taught in Fall 2009 (Tuesday/Thursday 12-1:20 pm, Wean 5304)
Prerequisite: permission of instructor

Course Description

Text-driven forecasting is an emerging collection of problems in which text documents or document collections are automatically analyzed to make specific, testable predictions about the future. Well-known examples include predictions about stock or market behavior, product sales patterns, government elections, legislative activities, or public opinion polls.

While a research community focusing on these problems has yet to form, this course is based on the following observations:

Forecasting provides a new driving force for research in natural language processing. What level of "understanding" is needed for predictions to be accurate?
Forecasting is a unique machine learning problem involving discrete non-IID data, time series, and very natural evaluation against real-world events (i.e., did the model correctly predict what would happen today?).
The rise of social media (and non-news text more generally) and their availability on the web, will inspire many new forecasting problems and datasets.
Focusing on tangible real-world predictions will provide a nexus for computer scientists to come together with domain experts to reason about language use and how it should be modeled.
Because people can never be expected to read all of the content relevant to a particular question about the future, intelligent text processing methods may be the only way such content can be fully exploited.

This twelve-credit seminar-project hybrid course aims to begin identifying challenge problems and testing some solutions to them.

Format

The time and location are TBD; please contact the instructor if you are interested in participating.

The course will meet twice a week for the first month or so, operating like a seminar with discussion of two or three papers per week and brainstorming. The remainder of the semester will focus on team projects, which will be the bulk of the grade. Each team of approximately three students will build a system that uses a text database to make testable, future predictions.

A student wishing to audit the course will be expected to attend the course meetings, serve as an informal consultant to one of the teams and write a short "lessons learned" paper at the end of the semester.

This course counts as a "lab" for LTI students.

Grading

Grades will be assigned based on participation in class discussions (40%) and the course project (60%).

Course Plan and Readings

Part 1: Seminar (roughly 1/3 of the semester)

Date	Readings to discuss	Notes
Tu 8-25	None; introductions, administrivia, and high-level discussion about the course.
Th 8-27	Das and Chen, 2007: Yahoo! for Amazon: Sentiment extraction from small talk on the Web. This is a journal version of a much-cited 2001 paper you can find here.	Note that the classification techniques in this paper are very simplistic, from the point of view of machine learning as well as computational linguistics. Brendan's notes.
Tu 9-1	Koppel and Shtrimberg, 2004: Good news or bad news? Let the market decide. Lavrenko, Schmill, Lawrie, Ogilvie, Jensen, and Allen, 2000: Mining of concurrent text and time series.	Vasco's notes.
Th 9-3	Ghose, Ipeirotis, and Sundararajan, 2007: Opinion mining using econometrics: a case study on reputation systems. Kogan, Levin, Routledge, Sagi, and Smith, 2009: Predicting risk from financial reports with regression.	Brendan's notes.
Tu 9-8	Antweiler and Frank, 2005: Do US stock markets typically overreact to corporate news stories? Skim only: Antweiler and Frank, 2004: Is all that talk just noise? The information content of Internet message boards.	Mahesh's notes.
Th 9-10	Danescu-Niculescu-Mizil, Kossinets, Kleinberg, and Lee, 2009: How opinions are received by online communities: A case study on Amazon.com helpfulness votes.	Mahesh's notes.
Tu 9-15	Monroe, Colaresi, and Quinn, 2009: Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict.	Dipanjan's notes.
Th 9-17	Lerman, Gilder, Dredze, and Pereira, 2008: Reading the markets: Forecasting public opinion of political candidates by news analysis	Ramnath's notes.
Tu 9-22	(no meeting)
Th 9-24	Gentzkow and Shapiro, 2007: What drives media slant? Evidence from U.S. daily newspapers.	Neel's notes.
Tu 9-29	Fader, Radev, Crespin, Monroe, Quinn, and Colaresi, 2007: MavenRank: Identifying influential members of the U.S. Senate using lexical centrality.	Dipanjan's notes.
Th 10-1	Tausczik and Pennebaker, 2009: The psychological meaning of words: LIWC and computerized text analysis methods.

Part 2: Projects (roughly 2/3 of the semester)

After deciding on project topics and forming teams, we will usually meet as a class once a week to discuss issues that come up in the projects and hear interim reports from each team. There may be some additional readings as well.

Date	Plan
Tu 10-6	Project proposals
Th 10-8	Project selection and division into teams
Tu 10-13	Zhang and Skiena, 2009: Improving movie gross prediction through news analysis.
Tu 10-20	Dodds and Danforth, 2009 Measuring the happiness of large-scale written expression: songs, blogs, and presidents.
Tu 10-27	Simonoff and Sparrow, 2000: Predicting movie grosses: Winners and losers, blockbusters and sleepers.
Tu 11-3	Friedman, Hastie, Tibshirani, 2009: Regularization paths for generalized linear models via coordinate descent.
Tu 11-10	Mishne and Glance, 2006: Predicting movie sales from blogger sentiment.
Tu 11-17	(no paper)
Tu 11-24	Liang, Jordan, Klein, 2009: "Learning semantic correspondences with less supervision.
Th 12-3	Final project presentations (Thursday, not Tuesday!)

Useful Resources

Subjectivity bibliography from Jan Wiebe at Pitt.

[an error occurred while processing this directive]