TurboParser (Dependency Parser with Linear Programming)
Background
Dependency parsing is a lightweight syntactic formalism that relies on lexical relationships between words.
Nonprojective dependency grammars may generate languages that are not context-free, offering a formalism
that is arguably more adequate for some natural languages.
Statistical parsers, learned from treebanks, have achieved the best performance in this task. While only local
models (arc-factored) allow for exact inference, it has been shown that including non-local features and performing
approximate inference can greatly increase performance.
This package contains a C++ implementation of a
dependency parser based on the papers [1,2,3,4,5] below.
The latest version of this package also contains C++ implementations of
a POS tagger, a semantic role labeler, a entity tagger,
a coreference resolver, and a constituent (phrase-based) parser.
The relevant references are the papers [6,7,8,9] below.
This package allows:
- learning a parser/tagger/semantic parser/entity tagger/coreference resolver from a treebank,
- running a parser/tagger/semantic parser/entity tagger/coreference resolver on new data,
- evaluating the results against a gold-standard.
Demo
News
We released TurboParser v2.3 on November 6th, 2015!
This version introduces some new features:
-
A named entity recognizer (TurboEntityRecognizer) based on the
Illinois Entity Tagger (ref. [7] below).
-
A coreference resolver (TurboCoreferenceResolver) based on the
Berkeley Coreference Resolution System (ref. [8] below).
-
A constituent parser based on dependency-to-constituent reduction,
implementing ref. [9] below.
-
A dependency labeler, TurboDependencyLabeler, that can optionally be applied
after the dependency parser.
-
Compatibility with MS Windows (using MSVC) and with C++0x.
We released TurboParser v2.2 on June 26th, 2014!
This version introduces some new features:
-
A Python wrapper for the tagger and parser (requires Cython 0.19).
-
A semantic role labeler (TurboSemanticParser) implementing ref. [6] below.
We released TurboParser v2.1 on May 23th, 2013!
This version introduces some new features:
-
The full model has now third-order parts for grand-siblings and tri-siblings (see ref. [5] below).
-
Compatibility with MS Windows (using MSVC).
We released TurboParser v2.0 on September 20th, 2012!
This version introduces a number of new features:
-
The parser does not depend anymore on CPLEX (or any other non-free LP solver).
Instead, the decoder is now based on AD3, our free library for
approximate MAP inference.
-
The parser now outputs dependency labels along with the backbone structure.
-
As a bonus, we now provide a trainable part-of-speech tagger, called TurboTagger, which can be used in standalone mode, or to provide part-of-speech
tags as input for the parser. TurboTagger has state-of-the-art accuracy for English (97.3% on section 23 of the Penn Treebank) and is fast (~40,000 tokens per second).
-
The parser is much faster than in previous versions. You may choose among a basic arc-factored parser (~4,300 tokens per second), a
standard second-order model with consecutive sibling and grandparent features (the default; ~1,200 tokens per second), and
a full model with head bigram and arbitrary sibling features (~900 tokens per second).
Note: The runtimes above are approximate, and based on experiments with a desktop machine with a Intel Core i7 CPU 3.4 GHz and 8GB RAM.
To run this software, you need a standard C++ compiler.
This software has the following external dependencies: AD3, a library for
approximate MAP inference; Eigen, a template
library for linear algebra; google-glog, a library for logging;
gflags, a library
for commandline flag processing. All these libraries are free software and are
provided as tarballs in this package.
This software has been tested in several Linux platforms. It has also
successfully compiled in Mac OS X and MS Windows (using MSVC).
Further Reading
The main technical ideas behind this software appear in the papers:
[1] |
André F. T. Martins, Noah A. Smith, and Eric P. Xing.
Concise Integer Linear Programming Formulations for Dependency Parsing.
Annual Meeting of the Association for Computational Linguistics (ACL'09), Singapore, August 2009.
|
[2] |
André F. T. Martins, Noah A. Smith, and Eric P. Xing.
Polyhedral Outer Approximations with Application to Natural Language Parsing.
International Conference on Machine Learning (ICML'09), Montreal, Canada, June 2009.
|
[3] |
André F. T. Martins, Noah A. Smith, Eric P. Xing, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.
TurboParsers: Dependency Parsing by Approximate Variational Inference.
Empirical Methods in Natural Language Processing (EMNLP'10), Boston, USA, October 2010.
|
[4] |
André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.
Dual Decomposition With Many Overlapping Components.
Empirical Methods in Natural Language Processing (EMNLP'11), Edinburgh, UK, July 2011.
|
[5] |
André F. T. Martins, Miguel B. Almeida, Noah A. Smith.
Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers.
In Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, August 2013.
|
[6] |
André F. T. Martins and Mariana S. C. Almeida.
Priberam: A Turbo Semantic Parser with Second Order Features.
In International Workshop on Semantic Evaluation (SemEval), task 8: Broad-Coverage Semantic Dependency Parsing, Dublin, August 2014.
|
[7] |
Lev Ratinov and Dan Roth.
Design Challenges and Misconceptions in Named Entity Recognition.
In International Conference on Natural Language Learning (CoNLL'09), 2009.
|
[8] |
Greg Durrett and Dan Klein.
Easy Victories and Uphill Battles in Coreference Resolution.
Empirical Methods in Natural Language Processing (EMNLP'13), 2013.
|
[9] |
Daniel F.-González and André F. T. Martins.
Parsing As Reduction.
In Annual Meeting of the Association for Computational Linguistics (ACL'15), Beijing, China, August 2015.
|
Download
The latest version of TurboParser is TurboParser v2.3.0 [~5.4MB,.tar.gz format].
See the README file for instructions for compilation, running, and file formatting.
It does not include the data sets used in the papers;
for information about how to get these data sets, please go to http://nextens.uvt.nl/~conll.
Bear in mind that some data sets must be separately licensed through the LDC.
In addition, we provide separately the following pre-trained models (notice that these are very large files):
- An English tagger trained on the sections 02-21 of the Penn Treebank.
Click here to download this model [~3.3MB, .tar.gz format].
Then, uncompress this model and save it in a local folder (e.g. as models/english_proj_tagger.model).
To tag a new file <input-file>, type:
./TurboTagger --test \
--file_model=models/english_proj_tagger.model \
--file_test=<input-file> \
--file_prediction=<output-file> \
--logtostderr
Check the README for file formatting instructions and additional options.
- First, second, and third-order English parsers trained on the sections 02-21 of the Penn Treebank,
with dependencies extracted using the head-rules of Yamada and Matsumoto, through Penn2Malt.
Click here to download these models [~1.8GB, .tar.gz format].
Uncompress this file and save the models in a local folder (e.g. as models/english_proj_parser_model-{basic,standard,full}.model).
To parse a new file <input-file> in CoNLL format, type:
./TurboParser --test \
--file_model=models/english_proj_parser_model-standard.model \
--file_test=<input-file> \
--file_prediction=<output-file> \
--logtostderr
Check the README for file formatting instructions and additional options.
- First, second, and third-order Arabic parsers trained in the Arabic dataset provided in the CoNLL-X shared task.
Click here to download these models [~520 MB, .tar.gz format].
Uncompress this file and save the models in a local folder (e.g. as models/arabic_model-{basic,standard,full}.model).
To parse a new file <input-file> in CoNLL format, type:
./TurboParser --test \
--file_model=models/arabic_parser_model-standard.model \
--file_test=<input-file> \
--file_prediction=<output-file> \
--logtostderr
Check the README for file formatting instructions and additional options.
- Taggers and parsers for Kinyarwanda and
Malagasy.
There is
a README
specifically for these models. They require TurboParser v. 2.0.2.
- Farsi parser trained on the Dadegan Persian treebank. Click here to download the model [~530 MB, .tar.gz format]. This model requires TurboParser v 2.0.2.
Associated Farsi NLP tools can be found here.
- Parsers that generate Stanford-style dependencies can be found here.
- A parser trained on the English Web Treebank for Stanford basic dependencies can be found here.
Finally, a script "parse.sh" is provided in this package that allows you to tag and parse
free text (in English, one sentence per line) with the models above. Just type:
./scripts/parse.sh <filename>
where <filename> is a text file with one sentence per line. If no filename is
specified, it parses stdin, so e.g.
echo "I solved the problem with statistics." | ./scripts/parse.sh
yields
1 I _ PRP PRP _ 2 SUB
2 solved _ VBD VBD _ 0 ROOT
3 the _ DT DT _ 4 NMOD
4 problem _ NN NN _ 2 OBJ
5 with _ IN IN _ 2 VMOD
6 statistics _ NNS NNS _ 5 PMOD
7 . _ . . _ 2 P
Older versions:
-
TurboParser v2.2.0 [~2.8MB,.tar.gz format].
-
TurboParser v2.1.0 [~2.5MB,.tar.gz format].
-
TurboParser v2.0.2 [~2.5MB,.tar.gz format].
-
TurboParser v2.0.1 [~2.5MB,.tar.gz format].
-
TurboParser v2.0 [~3.2MB,.tar.gz format].
-
TurboParser v0.1 [~2.5Mb,.tar.gz format].
Along with this distribution, we released
an English parser trained on the sections 02-21 of the Penn Treebank,
with dependencies extracted using the head-rules of Yamada and Matsumoto [~1.2 GB, .tar.gz format];
another English parser trained in the dataset provided in the CoNLL 2008 shared task [~1.4 GB, .tar.gz format];
an Arabic parser trained in the CoNLL-X dataset [~225 MB, .tar.gz format];
a script to apply these models to parse new data.
Contributing to TurboParser
For questions, bug fixes and comments, please e-mail afm [at] cs.cmu.edu.
To contribute to TurboParser, you can fork the following github repository: http://github.com/andre-martins/TurboParser.
To receive announcements about updates to TurboParser, join the ARK-tools mailing list.
Acknowledgments
A. M. was supported by a FCT/ICTI grant through
the CMU-Portugal Program, and by Priberam. This
work was partially supported by the FET programme
(EU FP7), under the SIMBAD project (contract 213250),
by National Science Foundation grant IIS-1054319,
and by the QNRF grant NPRP 08-485-1-083.