Andrew McCallum's Home Page

McCallum in the Grand Tetons National Park, 1991

Associate Professor, Department of Computer Science, University of Massachusetts Amherst, (2003-present)
Research Associate Professor, Department of Computer Science, University of Massachusetts Amherst, (2002-2003)
VP, Research and Development WhizBang! Labs; Director, WhizBang! Labs, East, (2000-2002.)
Adjunct Faculty, Carnegie Mellon University, Center for Automated Learning and Discovery & Language Technologies Institute, (1998-present).
Research Scientist, Research Coordinator, Just Research (aka JPRC), (1997-2000).
Postdoc, Carnegie Mellon, Computer Science with Sebastian Thrun and Tom Mitchell, 1996.
Ph.D., University of Rochester, Computer Science, with Dana Ballard, 1995.
B.A., Dartmouth College, 1989.
H.S., NCSSM, 1985.
A bio from the preface to my Ph.D. thesis (1995).
A more recent bio.
My CV.
Baby Pictures of my sister and me.
Pictures of my son Theo, born April 10, 1999.

I have moved to an Associate Professorship at UMass Amherst, and this page is out of date.
My current home page is http://www.cs.umass.edu/~mccallum.

Email:	mccallum@cs.umass.edu, mccallum@cs.cmu.edu,
Work:	(413) 545-1323, FAX: (413) 545-1789. Dept. of Computer Science, 140 Governors Drive, UMass, Amherst, MA 01003. Map,
Home:	(413) 549-7808, 128 Cottage Street, Amherst, MA 01002. Map.

Research Interests

Machine Learning applied to Text, Information Retrieval and Extraction. Since 1996 I have been working on statistical approaches to text classification, clustering and extraction. I was a member of the CMU Text Learning Group and Tom Mitchell's World-Wide Knowledge Base Project.
I was the leader of the project at JustResearch that created Cora, a domain-specific search engine over computer science research papers. It currently contains over 50,000 postscript papers. You can read more about our research on Cora in our IRJ journal paper or a paper presented at the AAAI'99 Spring Symposium. The Cora team also included Kamal Nigam, Kristie Seymore, Jason Rennie, Huan Chang and Jason Reed.
I am the author of rainbow, (and its library, libbow), a LGPL'ed software package for statistical text classification written in C.
I have been invited to give a tutorial at the Neural Information Processing Systems conference (NIPS*2002). The title is "Information Extraction from the World Wide Web".
With Lillian Lee, Tony Jebara and Kamal Nigam, I co-organized IJCAI'2001 workshop titled Text Learning: Beyond Supervision.
With Thorsten Joachims, Mehran Sahami and Lyle Ungar, I co-organized a IJCAI-99 workshop on Machine Learning for Information Filtering.
With Rich Caurana, Virginia de Sa and Michael Kearns, I co-organized a NIPS*98 workshop on "Integrating Supervised and Unsupervised Learning".
With Mehran Sahami, Mark Craven and Thorsten Joachims I co-organized a ICML/AAAI-98 workshop on Learning for Text Categorization.
Reinforcement Learning---especially with hidden state and factored representations. My thesis uses memory-based learning and a robust statistical test on reward in order to learn a structured policy representation that makes perceptual and memory distinctions only where needed for the task at hand. It can also be understood as a method of Value Function Approximation. The model learned is an order-n partially observable Markov decision process. It handles noisy observation, action and reward.
It is related to Ron, Singer and Tishby's Probabilistic Suffix Trees, Leslie Kaelbling's G-algorithm and Andrew Moore's Parti-game. It is distinguished from similar-era work by Michael Littman, Craig Boutilier and others in that it learns both a model and a policy, and is quite practical with infinite-horizon tasks and large state and observation spaces. Follow-on or comparison work has been done by Anders Jonsson, Andy Barto, Will Uther, Leslie Pack Kaelbling, Natalia Hernandez, and Sridhar Mahadevan.
The algorithm, called U-Tree, was demonstrated solving a highway driving task using simulated eye-movements and deictic representations. The simulated environment has about 21000 states, 2500 observations, noise and much hidden state. After about 2 1/2 hours of simulated experience, U-Tree learns a task-specific model of the environment that has only 143 states.

Publications

A list of my reinforcement learning publications can be found at my Rochester home page.
Publications of the WebKB group can be found through its homepage.
Some newer ones:
- Learning with Scope, with Application to Information Extraction and Classification. David Blei, Drew Bagnell and Andrew McCallum. UAI-2002.
- Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum and Fernando Pereira. ICML-2001.
- Toward Optimal Active Learning through Sampling Estimation of Error Reduction. Nick Roy and Andrew McCallum. ICML-2001.
- Unlocking the Information in Text. Dallan Quass, Andrew McCallum, William Cohen. The Future of Software, Winter 2000/2001.
- Learning to Understand the Web. William Cohen, Andrew McCallum, Dallan Quass. IEEE Data Engineering Bulletin. September 2000, Vol. 23, No. 3. Pages 17-24.
- Automating the Contruction of Internet Portals with Machine Learning. Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore. Information Retrieval Journal, volume 3, pages 127-163. Kluwer. 2000.
- Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag and Fernando Pereira. ICML-2000.
- Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. Andrew McCallum, Kamal Nigam and Lyle Ungar. KDD-2000.
- Information Extraction with HMM Structures Learned by Stochastic Optimization. Dayne Freitag and Andrew McCallum AAAI-2000.
- Creating Customized Authority Lists. Huan Chang, David Cohn and Andrew McCallum. ICML-2000.
- Semi-supervised Clustering with User Feedback. David Cohn, Rich Caruana and Andrew McCallum. Unpublished manuscript. (Submitted to AAAI 2000)
- Multi-Label Text Classification with a Mixture Model Trained by EM. Andrew McCallum. Revised version of paper appearing in AAAI'99 Workshop on Text Learning.
- A Hierarchical Probabilistic Model for Novelty Detection in Text. Doug Baker, Thomas Hofmann, Andrew McCallum and Yiming Yang. Unpublished manuscript. (Submitted to NIPS'99.)
- Using Maximum Entropy for Text Classification. Kamal Nigam, John Lafferty, Andrew McCallum. IJCAI'99 Workshop on Information Filtering.
- Information Extraction with HMMs and Shrinkage Dayne Frietag and Andrew McCallum. AAAI'99 Workshop on Machine Learning for Information Extraction.
- Learning Hidden Markov Model Structure for Information Extraction Kristie Seymore, Andrew McCallum, Roni Rosenfeld. AAAI'99 Workshop on Machine Learning for Information Extraction.
- Building Domain-Specific Search Engines with Machine Learning Techniques. Andrew McCallum, Kamal Nigam, Jason Rennie and Kristie Seymore. AAAI-99 Spring Symposium. A related paper was also accepted to IJCAI'99.
- Using Reinforcement Learning to Spider the Web Efficiently. Jason Rennie and Andrew McCallum. ICML'99.
- A Comparison of Event Models for Naive Bayes Text Classification. Andrew McCallum and Kamal Nigam. AAAI-98 Workshop on "Learning for Text Categorization".
- Improving Text Classification by Shrinkage in a Hierarchy of Classes. Andrew McCallum, Ronald Rosenfeld, Tom Mitchell and Andrew Ng. ICML-98.
- Employing EM in Pool-Based Active Learning for Text Classification. Andrew McCallum and Kamal Nigam. ICML-98.
- Distributional Clustering of Words for Text Classification. Doug Baker, Andrew McCallum. SIGIR-98.
- Text Classification from Labeled and Unlabeled Documents using EM. Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, 39(2/3). pp. 103-134. 2000.
- Learning to Classify Text from Labeled and Unlabeled Documents. Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. AAAI-98.
- Learning to Extract Knowledge from the World Wide Web. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery. AAAI-98.

Research Resources

Other Interests

Cooking. Especially bread baking.
Hiking. Especially in the White Mountains of New Hampshire.
Contra Dancing. It's traditional New England Folk dancing, and it is not the same as Square Dancing. See also Yahoo's Contra Dancing links.
The game of Go.
Juggling. Passing clubs.
Photography.
Hacking. I guess I can't get enough of it at work, because I even do it for fun. Crazy me.
- Along with Adam Fedor, I was the chief maintainer of GNUstep, the Free Software Foundation's effort to implement NeXT's OpenStep standard.
- I also hacked on libguileobjc, an interface between GNU Guile (a Scheme interpreter) and Objective C.
- I wrote persia, a toolkit for building virtual reality environments on Rochester's SGI Onyx RealityEngine2. The kit is based on SGI's Performer library and ELK Scheme.
- I wrote rlkit, a software library that makes it easy to test various reinforcement learning algorithms in different environments with different sensory-motor systems. It's implemented in Objective-C and Guile.