Text-Driven Forecasting (11-773)
Instructor: Prof. Noah Smith
History: Taught in Fall 2009 (Tuesday/Thursday 12-1:20 pm, Wean 5304)
Prerequisite: permission of instructor
Course Description
Text-driven forecasting is an emerging collection of problems in which
text documents or document collections are automatically analyzed to
make specific, testable predictions about the future. Well-known
examples include predictions about stock or market behavior, product sales
patterns,
government elections, legislative activities, or public opinion polls.
While a research community focusing on these problems has yet to form,
this course is based on the following observations:
- Forecasting provides a new driving force for research
in natural language processing. What level of "understanding" is needed for predictions to be accurate?
- Forecasting is a unique machine
learning problem involving discrete non-IID data, time series, and
very natural evaluation against real-world events (i.e., did the model
correctly predict what would happen today?).
- The rise of social media
(and non-news text more generally) and their availability on the web,
will inspire many new forecasting problems and datasets.
- Focusing on tangible real-world predictions will provide a nexus for computer scientists to come together with domain experts to reason about language use and how it should be modeled.
- Because people can never be expected to read all of the content relevant to a particular question about the future, intelligent text processing methods may be the only way such content can be fully exploited.
This twelve-credit seminar-project hybrid course aims to begin identifying
challenge problems and testing some solutions to them.
Format
The time and location are TBD; please contact the instructor if you are interested in participating.
The course will meet twice a week for the first month or so, operating like a seminar with discussion of two or three papers per week and brainstorming. The remainder of the semester will focus on team projects, which will be the bulk of the grade. Each team of approximately three students will build a system that uses a text database to make testable, future predictions.
A student wishing to audit the course will be expected to
attend the course meetings,
serve as an informal consultant to one of the teams and write a short "lessons
learned" paper at the end of the semester.
This course counts as a "lab" for LTI students.
Grading
Grades will be assigned based on participation in class discussions (40%) and the course project (60%).
Course Plan and Readings
Part 1: Seminar (roughly 1/3 of the semester)
Date | Readings to discuss | Notes |
Tu 8-25 | None; introductions, administrivia, and high-level discussion about the course. |
Th 8-27 | Das and Chen, 2007: Yahoo! for Amazon: Sentiment extraction from small talk on the Web. This is a journal version of a much-cited 2001 paper you can find here. | Note that the classification techniques in this paper are very simplistic, from the point of view of machine learning as well as computational linguistics. Brendan's notes. |
Tu 9-1 |
Koppel and Shtrimberg, 2004: Good news or bad news? Let the market decide.
Lavrenko, Schmill, Lawrie, Ogilvie, Jensen, and Allen, 2000: Mining of concurrent text and time series. | Vasco's notes.
|
Th 9-3 |
Ghose, Ipeirotis, and Sundararajan, 2007: Opinion mining using econometrics: a case study on reputation systems.
Kogan, Levin, Routledge, Sagi, and Smith, 2009: Predicting risk from financial reports with regression. | Brendan's notes. |
Tu 9-8 |
Antweiler and Frank, 2005: Do US stock markets typically overreact to corporate news stories?
Skim only:
Antweiler and Frank, 2004: Is all that talk just noise? The information content of Internet message boards.
| Mahesh's notes. |
Th 9-10 | Danescu-Niculescu-Mizil, Kossinets, Kleinberg, and Lee, 2009: How opinions are received by online communities: A case study on Amazon.com helpfulness votes. | Mahesh's notes. |
Tu 9-15 | Monroe, Colaresi, and Quinn, 2009: Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. | Dipanjan's notes. |
Th 9-17 | Lerman, Gilder, Dredze, and Pereira, 2008: Reading the markets: Forecasting public opinion of political candidates by news analysis | Ramnath's notes. |
Tu 9-22 | (no meeting) |
Th 9-24 | Gentzkow and Shapiro, 2007: What drives media slant? Evidence from U.S. daily newspapers. | Neel's notes. |
Tu 9-29 | Fader, Radev, Crespin, Monroe, Quinn, and Colaresi, 2007: MavenRank: Identifying influential members of the U.S. Senate using lexical centrality. | Dipanjan's notes. |
Th 10-1 | Tausczik and Pennebaker, 2009: The psychological meaning of words: LIWC and computerized text analysis methods. |
Part 2: Projects (roughly 2/3 of the semester)
After deciding on project topics and forming teams, we will usually meet as a class once a week to discuss issues that come up in the projects and hear interim reports from each team. There may be some additional readings as well.
Date | Plan |
Tu 10-6 | Project proposals |
Th 10-8 | Project selection and division into teams |
Tu 10-13 | Zhang and Skiena, 2009: Improving movie gross prediction through news analysis. |
Tu 10-20 | Dodds and Danforth, 2009 Measuring the happiness of large-scale written expression: songs, blogs, and presidents. |
Tu 10-27 | Simonoff and Sparrow, 2000: Predicting movie grosses: Winners and losers, blockbusters and sleepers. |
Tu 11-3 | Friedman, Hastie, Tibshirani, 2009: Regularization paths for generalized linear models via coordinate descent. |
Tu 11-10 | Mishne and Glance, 2006: Predicting movie sales from blogger sentiment. |
Tu 11-17 | (no paper) |
Tu 11-24 | Liang, Jordan, Klein, 2009: "Learning semantic correspondences with less supervision. |
Th 12-3 | Final project presentations (Thursday, not Tuesday!) |
Useful Resources