This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau.
git clone https://github.com/vchahun/teny.git
pip install -r teny/requirements.txt # install lxml
bzcat train.xml.bz2 | python teny/totext.py mlg > train.mlg.txt # extract the Malagasy text
bzcat train.xml.bz2 | python teny/totext.py eng > train.eng.txt # extract the English text
bzcat corpus.xml.bz2 | python teny/split.py 01 03 > test.xml # select days 1-3
bzcat corpus.xml.bz2 | python teny/split.py 04 04 > dev.xml # select day 4
bzcat corpus.xml.bz2 | python teny/split.py 05 31 > train.xml # select days 5-31
bzcat test.xml.bz2 | python teny/sample.py 11 > mt-test.xml # select ~1/11 document
bzcat dev.xml.bz2 | python teny/sample.py 2 > mt-dev.xml # select ~1/2 document
More data will be released periodically (~100 articles are published every month on Global Voices)
Sentence aligner used: Gargantua (Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora, F. Braune & A. Fraser, COLING 2010)
The original content was published under a Creative Commons Attribution-Only license.
This work was supported by the Army Research Office (grant number W911NF-10-1-0533).