μtopia - Microblog Translated Posts Parallel CorpusRelease V1.1 - 19/09/2013

μtopia is a Parallel Corpus that contains parallel sentence pairs mined from Microblogs. For more information refer to our work:

Microblogs as Parallel Corpora, Wang Ling, Guang Xiang, Chris Dyer, Alan Black and Isabel Trancoso, ACL 2013, [pdf] [bib]

Some quick references:

If your goal is to build Machine Translation Systems for Social Media and Microblogs, check out the Machine Translation Section.
If you are attempting to crawl parallel data from Microblogs, check out the Parallel Data Extraction Section.
If you wish to download the parallel data, you can learn how to do so in the Weibo Corpus and Twitter Corpus sections. If you only need a small amount of corpora and/or do not wish to crawl data, you can find a small but high quality parallel corpus for Chinese-English in the Machine Translation Section.

Also, if you use the resources in this website, we would appreciate it if you support our work by citing our paper. Bibtex available here.

Parallel Data Extraction

We are interested in users that parallel post (post tweets with translations in multiple languages). Here is an example of Snoop Dogg and Psy posting in parallel in Sina Weibo:

The parallel extraction system is composed of two main components.

Data Crawling

Before extracting parallel corpus, you must have a crawl of tweets. You can either use pre-crawled tweets or build your own crawler. If you wish to build your own crawler, here are some resources you can refer to:

Open API Guide - Teaches you to create and setup an Weibo Open API account. You will need to use this API to crawl messages from Sina Weibo.
Java API Client - A Java Helper API to access the Sina Weibo API. Note: We did not write this toolkit and credit goes to the original author of this API. We have this here, since it might be hard for non-native Mandarin Speakers to find it.

Alignment and Extraction

Our extraction process is divided into two tasks:

Identification of parallel tweets - Find potential tweets that contain parallel sentences.
Extraction of the parallel segments - Try to find and align the parallel segments within a tweet.

If you are interested in working in the extraction of parallel data from Microblogs, you can download our gold standard, available here.

It contains 4347 manually annotated tweets. Each tweet is annotated with the following information: (1) is there parallel data? (2) if so, what is the language pair of the data? (3) what are the spans of the parallel data?

For research purposes you do not necessarily need to build a crawler. Your goal would be to find ways to improve the parallel data detection in this gold standard, as well as the parallel segment alignment. Metrics to evaluate these tasks are described in our paper.

Machine Translation in Social Media

While Machine Translation on Microblogs and Social Media is a hot topic, the actual work that has been published in this area is still very limited. We believe that one big factor is the lack of parallel data to train, tune and test MT systems.

However, this is an interesting problem, since text in this domain is radically different from commonly translated domains, such as news or parliament data. Here are some examples of parallel sentences that exemplify frequent phenomena in this domain:

Source (English) Target (Mandarin)

Abbreviations She love my tattoos ain't got no room for her name, but imma make room - 她喜欢我的纹身，那上面没有纹她名字的地方了，不过我会弄出空余空间

Orthographic Errors happy singles day in China - sorry I won't be celebratin witchu, I have my love... - 中国的粉丝们，光棍节快乐 - 抱歉我不能和你们一起庆祝节日了，我有我可爱的老婆S...

Syntactic Errors Less guilty of some wrong, will be able to talk less "I'm sorry " to themselves or others are worth celebrating. 少犯一些错，就能少说“对不起”，这对自己或对别人都是值得庆幸的。

Emoticons So excited to reveal the title of my new album on the new KellyRowland.com :) - 非常激动要在 KellyRowland.com 上揭晓我新专辑的名字了:)

Social Media Machine Translation Toolkit

To promote research in this direction, we present our toolkit SMMTT (Social Media Machine Translation Toolkit). You can check out from https://github.com/wlin12/SMMTT or download it here. If you are pursuing research on Machine Translations in Social Media or Microblog data, we recommend you to use this toolkit as a starting point, as it provides a baseline for your experiments.

Data - The toolkit provides 8000 training, 1250 development and 1250 test sentence pairs for the Mandarin-English language pair. These were carefully extracted from the full 3M sentence pairs extracted from Sina Weibo using heuristics (filtering duplicates, removing frequent alignment errors and finding users that frequently post parallel messages). You can find this corpus in the "./data" directory.

Modeling - It also provides scripts to automatically build a translation system using the training and development sets, and evaluate the results using the test set by running the "./scripts/runExperiment.sh" script.

Baseline - Results were computed and presented in the MT marathon 2013. You can use these as baseline for your experiments.

Datasets used BLEU Experiment Description

μtopia (8k) 14.33 Relatively small in domain data

FBIS (300k) 12.84 Relatively large out of domain data

μtopia (8k)+FBIS (300k) 16.28 Combining in domain and out of domain data

For now, you can use this article as a reference to the toolkit.

Sina Weibo Parallel Corpus

Sina Weibo does not allow publishing textual information to be published(clause 5.1). So, we will only publish the IDs of the messages and provide tools to crawl the data from the server.

Note that our crawler was built to prioritize the crawling English-Chinese sentence pairs, which is why the ratio between the size English-Chinese corpus is so much larger than other language pairs.

Data Format - Each corpus folder contains the following structure:

README - Instructions for this dataset, please read very carefully.
COPYING - Copyright for this dataset, please read even more carefully.
meta/meta.csv - post ids and metadata(csv format).
meta/meta.json - post ids and metadata(json format).
data/data.(lang-pair).s - source sentences (should contain only some samples if you just downloaded this dataset).
data/data.(lang-pair).t - target sentences (should contain only some samples if you just downloaded this dataset).
data/data.(lang-pair).json - json describing the parallel sentence (should contain only some samples if you just downloaded this dataset).

To build the dataset you must use the data in meta/meta.csv or meta/meta.json to extract the actual posts from the provider(Sina Weibo or Twitter). Each line of the metadata files contain a tweet id, used to find the tweet, and the indexes of the parallel segments within that tweet. Using this information it is possible to extract the tweet from Weibo and retrieve the parallel segments.

You will need to build a crawler to retrieve the original message given the ID. Check out the Parallel Data Extraction Section for pointers to do so. An alternative, if you do not wish to crawl, is to use the small dataset provided in the Machine Translation Section, instead.

How to get it - You can download the corpus below:

Language Pair	Num Sentences
Mandarin-English	800K	[Sample] [Download]
Mandarin-Arabic	6K	[Sample] [Download]
Mandarin-Russian	12K	[Sample] [Download]
Mandarin-Korean	41K	[Sample] [Download]
Mandarin-German	49K	[Sample] [Download]
Mandarin-French	43K	[Sample] [Download]
Mandarin-Spanish	36K	[Sample] [Download]
Mandarin-Portuguese	25K	[Sample] [Download]
Mandarin-Czech	21K	[Sample] [Download]
Everything	1M	[Download]

Twitter Parallel Corpus

Twitter only allows the IDs(clause 4.a) of the tweets to be published. Thus, we shall only publish the IDs of the messages, and our meta-data that describes how to obtain the parallel data.

Data Format - Each corpus folder contains the following structure:

README - Instructions for this dataset, please read very carefully.
COPYING - Copyright for this dataset, please read even more carefully.
meta/meta.csv - post ids and metadata(csv format).
meta/meta.json - post ids and metadata(json format).
data/data.(lang-pair).s - source sentences (should contain only some samples if you just downloaded this dataset).
data/data.(lang-pair).t - target sentences (should contain only some samples if you just downloaded this dataset).
data/data.(lang-pair).json - json describing the parallel sentence (should contain only some samples if you just downloaded this dataset).

To build the dataset you must use the data in meta/meta.csv or meta/meta.json to extract the actual posts from the provider(Sina Weibo or Twitter). Each line of the metadata files contain a tweet id, used to find the tweet, and the indexes of the parallel segments within that tweet. Using this information it is possible to extract the tweet from Weibo and retrieve the parallel segments.

You will need to build a crawler to retrieve the original message given the ID. Check out the Parallel Data Extraction Section for pointers to do so. An alternative, if you do not wish to crawl, is to use the small dataset provided in the Machine Translation Section, instead.

How to get it - You can download the corpus below:

Language Pair	Num Sentences
English-Mandarin	113K	[Sample] [Download]
English-Arabic	114K	[Sample] [Download]
English-Russian	119K	[Sample] [Download]
English-Korean	78K	[Sample] [Download]
English-Japanese	75K	[Sample] [Download]
Everything	500K	[Download]

Twitter Crowdsourced Gold Standard

This section contains the crowdsourced gold standard corpus described in:

Crowdsourcing High-Quality Parallel Data Extraction from Twitter, Wang Ling, Luis Marujo, Chris Dyer, Alan W Black, Isabel Trancoso, WMT 2014, [pdf] [bib]

The corpus contains human extracted parallel data from Twitter. This achieves a higher quality parallel segments than using automatic methods.

How to get it - You can download the corpus below:

Language Pair	Num Sentences
English-Spanish	722	[Download]
English-French	594	[Download]
English-Russian	1631	[Download]
English-Korean	1076	[Download]
English-Japanese	953	[Download]

* Data for Portuguese, German, Mandarin and Arabic coming soon.

Terms Of Use

We are not aware of any copyright restrictions of the material we are publishing (IDs and metadata). If you use these datasets and tools in your research, please cite the our paper (bibtex available here).

Also, please keep in mind that after crawling the data from Twitter and Sina Weibo, you are subject to their terms and conditions. Twitter's terms can be found here. Sina Weibo's terms can be found here and here. We do not endorse and shall not be held responsible or liable for damages resulting from the impropriate usage of any content downloaded in this site.

Acknowledgements

I, Wang Ling, would like to give thanks to:

To FCT
I would like to thank FCT (Fundação para a Ciência e a Tecnologia) from the bottom of my heart for funding my Phd. The time I spent as a Phd Student in Carnegie Mellon University and Instituto Superior Tecnico.

To My Advisors
I would like to thank my three advisors Alan Black, Isabel Trancoso and Chris Dyer for making me realise how D'Artagnan felt in "The Three Musketeers".

To My Collaborators
To my collaborator and friend Guang Xiang, I would like to thank you for the help and support in the duration of this my work.

To Others

Many thanks to Brendon O'Connor for providing his large Twitter data, which was essential to extract the Parallel data from the Twitter domain.

Thanks to Justin Chiu for occasionally posting on his Facebook wall in parallel.