Software and Data

Software

This software has been developed by my research group, me, and collaborators around the world.

cnn - a flexible, relatively fast neural network toolkit optimized for text modeling.
cdec - a fast, mature decoder, alignment, and modeling toolkit for statistical machine translation and similar structure-prediction problems.
cpyp - a small, lightweight C++ Pitman–Yor process library
HyperGREX - a simple, elegant but feature-filled rule extractor for a variety of syntactic translation formalisms (for use with cdec).
The CMU cross-lingual metaphor detector - a toolkit for identifying instances of figurative language in English and any other language for which a bilingual dictionary is available.
fast_align - a very fast—but pretty effective—unsupervised bilingual word aligner.
creg - a small and fast toolkit for large-scale linear, logistic, and ordinal regression modeling.

English adjective supersenses - a 13-class supersense taxonomy of English adjectives developed by Yulia Tsvetkov
- And supersense classifier
Korean-English Wikipedia Titles - a parallel corpus of Wikipedia titles from January 2012.
Chinese-English place names - a parallel corpus of Chinese place names from Wikipedia.