CorpusBuilder
CorpusBuilder is a system for automatically constructing corpora for a minority language from the web. It is written in perl, and makes use of van Noord's TextCat system. You need perl 5.0 or greater and lynx, running on a unix system, to make use of CorpusBuilder.
Email us if you'd like to use it.
Members
Publications
- Rayid Ghani, Rosie Jones and Dunja Mladenic: "Building Minority L\anguage Corpora by Learning to Generate Web Search Queries"
KAIS Knowledge and Information Systems, 2003
[gzipped postscript] [PDF]
- Rayid Ghani, Rosie Jones and Dunja Mladenic:
"Using the Web to Create Minority Language Corpora",
10th International Conference on Information and Knowledge Management
(CIKM-2001)
[gzipped Postscript]
- Rayid Ghani, Rosie Jones and Dunja Mladenic:
"On-line learning for Web query generation: finding documents matching a minority concept on the Web",
Proceedings of the The First Asia-Pacific Conference on Web Intelligence
(WI-2001)
[gzipped Postscript]
- Rayid Ghani, Rosie Jones and Dunja Mladenic:
"Automatic Web Search Query Generation to Create Minority Language Corpora",
Poster paper in proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001)
[gzipped Postscript]
- Rayid Ghani, Rosie Jones and Dunja Mladenic:
"Building Minority Language Corpora by Learning to Generate Web Search Queries",
Carnegie Mellon University
Center for Automated Learning and Discovery
Technical Report CMU-CALD-01-100 (2001)
[gzipped Postscript]
[PDF]
- Rosie Jones and Rayid Ghani:
"Automatically Building a Corpus of a Minority Language from the Web"
ACL 2000
Student Research Workshop
[gzipped Postscript]
[PDF] (version updated September 2000)
- Rayid Ghani and Rosie Jones
"Learning a Monolingual Language Model from a Multilingual Text Database"
Ninth International Conference on Information and Knowledge Management (CIKM-2000)
[gzipped Postscript]
[PDF]
Corpora
Rosie Jones
Last modified: Fri Mar 26 20:24:45 EST 2004