μtopia - Microblog Translated Posts Parallel CorpusRelease V1.1 - 19/09/2013μtopia is a Parallel Corpus that contains parallel sentence pairs mined from Microblogs. For more information refer to our work:
Some quick references:
Also, if you use the resources in this website, we would appreciate it if you support our work by citing our paper. Bibtex available here. Parallel Data ExtractionWe are interested in users that parallel post (post tweets with translations in multiple languages). Here is an example of Snoop Dogg and Psy posting in parallel in Sina Weibo: The parallel extraction system is composed of two main components. Data CrawlingBefore extracting parallel corpus, you must have a crawl of tweets. You can either use pre-crawled tweets or build your own crawler. If you wish to build your own crawler, here are some resources you can refer to:
Alignment and ExtractionOur extraction process is divided into two tasks:
If you are interested in working in the extraction of parallel data from Microblogs, you can download our gold standard, available here. It contains 4347 manually annotated tweets. Each tweet is annotated with the following information: (1) is there parallel data? (2) if so, what is the language pair of the data? (3) what are the spans of the parallel data? For research purposes you do not necessarily need to build a crawler. Your goal would be to find ways to improve the parallel data detection in this gold standard, as well as the parallel segment alignment. Metrics to evaluate these tasks are described in our paper. Machine Translation in Social MediaWhile Machine Translation on Microblogs and Social Media is a hot topic, the actual work that has been published in this area is still very limited. We believe that one big factor is the lack of parallel data to train, tune and test MT systems. However, this is an interesting problem, since text in this domain is radically different from commonly translated domains, such as news or parliament data. Here are some examples of parallel sentences that exemplify frequent phenomena in this domain:
Social Media Machine Translation ToolkitTo promote research in this direction, we present our toolkit SMMTT (Social Media Machine Translation Toolkit). You can check out from https://github.com/wlin12/SMMTT or download it here. If you are pursuing research on Machine Translations in Social Media or Microblog data, we recommend you to use this toolkit as a starting point, as it provides a baseline for your experiments. Data - The toolkit provides 8000 training, 1250 development and 1250 test sentence pairs for the Mandarin-English language pair. These were carefully extracted from the full 3M sentence pairs extracted from Sina Weibo using heuristics (filtering duplicates, removing frequent alignment errors and finding users that frequently post parallel messages). You can find this corpus in the "./data" directory. Modeling - It also provides scripts to automatically build a translation system using the training and development sets, and evaluate the results using the test set by running the "./scripts/runExperiment.sh" script. Baseline - Results were computed and presented in the MT marathon 2013. You can use these as baseline for your experiments.
Sina Weibo Parallel CorpusSina Weibo does not allow publishing textual information to be published(clause 5.1). So, we will only publish the IDs of the messages and provide tools to crawl the data from the server. Note that our crawler was built to prioritize the crawling English-Chinese sentence pairs, which is why the ratio between the size English-Chinese corpus is so much larger than other language pairs. Data Format - Each corpus folder contains the following structure:
To build the dataset you must use the data in meta/meta.csv or meta/meta.json to extract the actual posts from the provider(Sina Weibo or Twitter). Each line of the metadata files contain a tweet id, used to find the tweet, and the indexes of the parallel segments within that tweet. Using this information it is possible to extract the tweet from Weibo and retrieve the parallel segments. You will need to build a crawler to retrieve the original message given the ID. Check out the Parallel Data Extraction Section for pointers to do so. An alternative, if you do not wish to crawl, is to use the small dataset provided in the Machine Translation Section, instead. How to get it - You can download the corpus below:
Twitter Parallel CorpusTwitter only allows the IDs(clause 4.a) of the tweets to be published. Thus, we shall only publish the IDs of the messages, and our meta-data that describes how to obtain the parallel data. Data Format - Each corpus folder contains the following structure:
To build the dataset you must use the data in meta/meta.csv or meta/meta.json to extract the actual posts from the provider(Sina Weibo or Twitter). Each line of the metadata files contain a tweet id, used to find the tweet, and the indexes of the parallel segments within that tweet. Using this information it is possible to extract the tweet from Weibo and retrieve the parallel segments. You will need to build a crawler to retrieve the original message given the ID. Check out the Parallel Data Extraction Section for pointers to do so. An alternative, if you do not wish to crawl, is to use the small dataset provided in the Machine Translation Section, instead. How to get it - You can download the corpus below:
Twitter Crowdsourced Gold StandardThis section contains the crowdsourced gold standard corpus described in: The corpus contains human extracted parallel data from Twitter. This achieves a higher quality parallel segments than using automatic methods. How to get it - You can download the corpus below:
Terms Of UseWe are not aware of any copyright restrictions of the material we are publishing (IDs and metadata). If you use these datasets and tools in your research, please cite the our paper (bibtex available here). Also, please keep in mind that after crawling the data from Twitter and Sina Weibo, you are subject to their terms and conditions. Twitter's terms can be found here. Sina Weibo's terms can be found here and here. We do not endorse and shall not be held responsible or liable for damages resulting from the impropriate usage of any content downloaded in this site. AcknowledgementsI, Wang Ling, would like to give thanks to:
The content in this website was created by Wang Ling. Feel free to mail me ideas, feedback or suggestions and I will do my best to accommodate them.
|