Should All Cross-Lingual Embeddings Speak English?

Antonis Anastasopoulos, April 9, 2020.

This is a post regarding our paper that got accepted at ACL 2020.

Word embeddings are ubiquitous in modern NLP, from static ones (like word2vec or fasttext) to contextual representations obtained from ELMo, BERT, and other models. One very interesting property of learned word embeddings is that, since they (ideally) model the syntactic or semantic relations between words, the embedding space that we learn in two different languages will have somewhat similar properties and structure.

In fact, there is a lot of research on how we can leverage these similarities in order to map embedding spaces from different languages into each other. The general idea is perhaps better illustrated with the following figure (taken from [0]). Starting with two spaces (A) we learn a rotation (B) and we do some finetuning (C) and now we have a mapping between the Italian and the English space. (This can be also extended to multiple languages.)
By looking at which words across the two languages end up closest to each other (D), we can now evaluate our learned mapping: if the closest neighbors are words that are actually translations of each other (i.e. we can find this pair in a lexicon), we did a good job. Otherwise, we probably haven't learned a good mapping. This whole process is called Bilingual Lexicon Induction from Bilingual Embeddings.

Unfortunately, most of recent work in Bilingual or Multilingual Lexicon Induction has been severely Anglocentric. On the evaluation side, most of the evaluation lexicons are between English and other languages, rather than between any language pairs. On the modeling/training side, quite often English is chosen as the language/embedding space on which the others are aligned to (which we refer to as the hub).
Here, we will first show how we created more evaluation lexicons on several language pairs, then discuss the importance of evaluating our systems on diverse language pairs, and show that indeed only focusing on English-centric models and evaluation settings can be detrimental for downstream performance.

Creating Dictionaries for Lexicon Induction

Let's briefly go over the list of the popular LI datasets:

a) MUSE: dictionaries between 45 languages and English, and dictionaries among English, French, German, Italian, Portuguese, and Spanish.

b) Morphologically Complete Dictionaries from [1]: 40 dictionaries in 10 languages, in Romance (Catalan, French, Italian, Spanish, Portuguese) and Slavic (Czech, Polish, Russian, Slovak, Ukrainian) languages.

c) Older works usually evaluated on a few language paris like English-Spanish, English-French, etc.

It's clear that the current evaluation landscape evolves around European languages, and English in particular, and more diverse evaluation dictionaries are needed. So, we create new dictionaries by relying in some good old Machine Translation tools and concepts: word alignment, and triangulation.

Dictionaries for low-resource languages

As a proof-of-concept, we collect parallel data between Azerbaijani, Belarusian, Galician, and English from several sources. We then align the parallel sentences with fast align, and extract word translation pairs with some heuristics (based on word frequency and alignment/translation probabilities). This way, we can create dictionaries with 6k, 7k, and 36k word pairs for each language, in which we can be at least somewhat confident.

Triangulation

The concept of triangulation is very simple. See Figure 1: if the Portuguese word "trabalho" can be translated to English as "job" or "work", and then "job" can be translated to Czech as "praca", "prácu", or "zamestnanie", then it makes sense for these Czech words to also be translations of the Portuguese "trabalho".

So, we employ this idea and triangulate all Muse dictionaries through English: for any X-English and English-Y dictionary, we create a X-Y one.

Filtering

The last step has to do with filtering the produced dictionaries, so that they follow community best practices and common sense when translating morphologically rich languages. So:

We remove all entries with proper nouns or words foreign to the language, as suggested in [2].

Portuguese	Greek	Pt POS	Gr POS	decision
obra	εργάζομαι	Noun	Verb	reject
obra	εργασία	Noun	Noun	keep
obra	δουλειά	Noun	Noun	keep
trabalho	εργάζομαι	Verb	Verb	keep
trabalho	εργασία	Verb	Noun	reject
trabalho	δουλειά	Verb	Noun	reject

We remove all entries with different part-of-speech (POS) tags between the two languages.
For example, the English word "work" can be translated to the Portuguese noun "obra", but it can also be translated as a verb as in "I work" to the verb "trabalho". Similarly, we could translate "work" into the Greek verbs "εργάζομαι" (meaning "I work") or into the nouns "εργασία" or "δουλειά" (both meaning the thing that you do when working).
If we triangulate through English, we would get the Portuguese-Greek pairs shown in the Table to the right. But it translations from verbs to nouns and vice versa are not valid, so we reject these pairs, keeping only those that have the same part-of-speech tag.

The above intuition can be extended to any morphological feature, such as number. The English "beautiful" can be either singular or plural, but languages like Spanish or Italian use different inflections for different numbers. So we only keep Spanish-Italian pairs if they agree in number, e.g. the singular guapo-bello, guapo-bella, guapa-bello, etc or the plural guapos-belli, guapas-belle, etc.[3]

Overall, we end up removing about 60% of the entries in almost 3000 of the 4900 triangulated dictionaries. Since we rely on the otherwise excellent Stanza library for POS and morphological tagging, we are limited to the languages it supports, so we couldn't filter dictionaries in 11 langauges.[4]

All the dictionaries are available here.

Analysis in Bilingual Settings

With our diverse lexicons at hand, we can now perform interesting analyses:

1. Evaluate BLI performance between diverse language pairs
We compare several Bilingual LI methods like MUSE [0] and Vecmap [5]. The quality between non-English pairs can vary dramatically. For instance, although English-Portuguese LI has accuracy of more than 70% and Enligh-French more than 80%, the best accuracy we got on Ukrainian-Hindi is only 17.9%, on Korean-Rusian it is 11.7%, and on Russian-Turkish it is 27.7\% (all with VecMap).

2. Compare BLI performance between typologically related languages and distant ones
Unsurprisingly, we confirm the intuition that typologically distant languages are harder to align than related ones. For example, Slovak-Czech accuracy is around 70%, Slovak-Russian is around 45%, but Slovak-Turkish accuracy drops to 31%.

Multilingual Settings: the hub matters

An alternative to Bilingual lexicon induction is learning multilingual word embeddings (MWE), where one jointly aligns multiple embeddings spaces. The main approaches are conceptually similar: they use one of the spaces as a hub that remains invariant (you guessed it, the default is English), and align the rest of the spaces to it, typically enforcing some sort of agreement between all embedding spaces. For our analysis we use the state-of-the-art MAT+MPSR method of Chen and Cardie[6].

1. Does the hub language matter?
Yes! In the figure to the right, taken from an experiment where we aligned embedding spaces from 10 languages, we show that the accuracy on the Slovak-Galician dictionary can vary between as low as 20.6 and as high as 28.4, a statistically significant difference (with a standard deviation of more than 2 percentage points across all possible hub languages).

2. Is English the best hub language?
No! In our two experiments, aligning 10 and 8 languages respectively, we found that English is the best hub language for MWE in less than 20% of the cases. To be fair, English is also not the worst choice. Very low-resource languages like Galician or Belarusian tend to perform much worse overall as a hub, probably due to the lower quality of their pre-trained embeddings.

3. How much can we gain from picking the best hub language?
We show the expected gain in two MWE experiments from picking each language as the best hub, both overall and when the language is indeed the best choice for a hub, in the histograms to the right.
The way to interpret the histogram is as follows.

a) The higher the red bar ('overall' gain), the safer this language is as a choice, given these set of languages. In the top experiment (between English, French, Hindi, Korean, Russian, Slovenian, and Ukrainian) the overall negative expected gain that Hindi and Slovenian have means that they are risky choices. Russian, on the other hand, is the safest choice in this experiment.

b) The blue bar tells us how much accuracy we would gain (on average) if we actually picked this language when it is the best hub (something which, unfortunately, we can only know for sure after we run all the experiments). The differences between the languages here are not too big, but it's interesting that all of them are positive -- hence, we would always have better performance if we indeed picked the best hub.

4. Can we somehow pick the best hub language before running the experiments?
Mmmmmmaybe. A finding we are confident in, is that if you care about a specific source-target language pair out of the multiple languages that you are aligning, then you shouldn't use neither the source nor the target as the hub, but another language.
There also seems to be a positive correlation between LI performance and distance measures between the source-hub and target-hub language pairs. We used the typological distance from the URIEL database[7], and found Pearson's correlation coefficients of 0.49 and 0.38 for our two MWE experiments.

Downstream: Zero-Shot POS Tagging

We've seen so far that Lexicon Induction performance can vary based on the hub language choice. However, differences in LI performance do not necessarily translate to differences in the downstream tasks that use the aligned embeddings.
So, we performed a small experiment where we trained POS taggers in English and Portuguese, and then performed zero-shot tagging on other languages (Spanish and Galician) using the jointly learned embeddings form MWE.

The differences in performance (evaluated by POS tagging accuracy) are stark! Here are the results transfering from Portuguese to Galician, with each of these languages as hubs:

English: 32.9%
Spanish: 25.5% ← Spanish was the best hub for LI
Portuguese: 36.5% ← best POS tagging performance
Galician: 23.8%

On top of that, it seems that in almost all cases, the best hub language for the Lexicon Induction experiments is different than the best hub language for the POS tagging ones.

Summary and Future Challenges

The main takeaway is that we should be thinking more about language diversity and the way that we construct our experiments. We should try to evaluate on as many languages as we can, and make that set of languages as diverse as possible. To this end, in the hope that we'll stir the community towards more challenging evaluation scenarios, we provide 4900 training and evaluation dictionaries (you can find them here).[8]

Regarding Multilingual Embeddings, we should consider the choice of the hub language as another hyper-parameter to be optimized, as it can have a large impact on the final results, both for lexicon induction and other downstream tasks.

Picking the best hub language a priori is not easy, and it is definitely a challenging future research direction! We provide all our experimental results with the paper, so perhaps someone could try to train a model to predict the best hub language, in the same way that Lin et al[9] train a model to choose the best transfer language.
Another interesting research direction would be to focus on minimizing the sensitivity of our machine learning approaches to hyperparameter choices (like the choice of the hub language) and devise techniques that work robustly across the board.

Footnotes and References

[0] Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. "Word Translation Without Parallel Data" Proceedings of ICLR 2018. 2018.
[1] Czarnowska, Paula, et al. "Don't Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
[2] Yova Kementchedjhieva, Mareike Hartmann, and Anders Søgaard. "Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction" Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
[3] You'll notice that the Spanish and Italian words do not agree in gender. The Spanish word "guapo" is masculine, while the Italian "bella" is feminine. We specifically ignore grammatical gender because it tends to be arbitrary for inanimate objects: the Spanish "la alfombra" (feminine, meaning carpet) translates to "il tappeto" in Italian (masculine).
[4]We were unable to filter the dictionaries in Azerbaijani, Belarusian, Bengali, Bosnian, Lithuanian, Macedonian, Malay, Albanian, Tamil, Thai, and Telugu.
[5] Mikel Artetxe, Gorka Labaka, and Eneko Agirre. "A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings" In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018.
[6] Xilun Chen and Claire Cardie. "Unsupervised multilingual word embeddings" In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 261–270. Association for Computational Linguistics. 2018.
[7] Patrick Littell, David R Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. "Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors" In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14. 2017.
[8] BTW, here is another note I wrote on the proper evaluation of multilingual benchmarks.
[9] Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neubig. "Choosing Transfer Languages for Cross-Lingual Learning" In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3125–3135, Florence, Italy. Association for Computational Linguistics. 2019.