Word embeddings are ubiquitous in modern NLP, from static ones (like word2vec or fasttext) to contextual representations obtained from ELMo, BERT, and other models. One very interesting property of learned word embeddings is that, since they (ideally) model the syntactic or semantic relations between words, the embedding space that we learn in two different languages will have somewhat similar properties and structure.
In fact, there is a lot of research on how we can leverage these similarities in order to
By looking at which words across the two languages end up closest to each other (D), we can now evaluate our learned mapping: if the closest neighbors are words that are actually translations of each other (i.e. we can find this pair in a lexicon), we did a good job. Otherwise, we probably haven't learned a good mapping. This whole process is called Bilingual Lexicon Induction from Bilingual Embeddings.
Unfortunately, most of recent work in Bilingual or Multilingual Lexicon Induction has been severely Anglocentric.
On the evaluation side, most of the evaluation lexicons are between English and other languages, rather than between any language pairs.
On the modeling/training side, quite often English is chosen as the language/embedding space on which the others are aligned to (which we refer to as the hub).
Here, we will first show how we created more evaluation lexicons on several language pairs, then discuss the importance of evaluating our systems on diverse language pairs, and show that indeed only focusing on English-centric models and evaluation settings can be detrimental for downstream performance.
Let's briefly go over the list of the popular LI datasets:
As a proof-of-concept, we collect parallel data between Azerbaijani, Belarusian, Galician, and English from several sources. We then align the parallel sentences with fast align, and extract word translation pairs with some heuristics (based on word frequency and alignment/translation probabilities). This way, we can create dictionaries with 6k, 7k, and 36k word pairs for each language, in which we can be at least somewhat confident.
The concept of triangulation is very simple. See Figure 1: if the Portuguese word "trabalho" can be translated to English as "job" or "work", and then "job" can be translated to Czech as "praca", "prácu", or "zamestnanie", then it makes sense for these Czech words to also be translations of the Portuguese "trabalho".
So, we employ this idea and triangulate all Muse dictionaries through English: for any X-English and English-Y dictionary, we create a X-Y one.
The last step has to do with filtering the produced dictionaries, so that they follow community best practices and common sense when translating morphologically rich languages. So:
Portuguese | Greek | Pt POS | Gr POS | decision |
---|---|---|---|---|
obra | εργάζομαι | Noun | Verb | reject |
obra | εργασία | Noun | Noun | keep |
obra | δουλειά | Noun | Noun | keep |
trabalho | εργάζομαι | Verb | Verb | keep |
trabalho | εργασία | Verb | Noun | reject |
trabalho | δουλειά | Verb | Noun | reject |
All the dictionaries are available here.
With our diverse lexicons at hand, we can now perform interesting analyses:
1. Evaluate BLI performance between diverse language pairs
We compare several Bilingual LI methods like MUSE [0] and Vecmap [5]. The quality between non-English pairs can vary dramatically. For instance, although English-Portuguese LI has accuracy of more than 70% and Enligh-French more than 80%, the best accuracy we got on Ukrainian-Hindi is only 17.9%, on Korean-Rusian it is 11.7%, and on Russian-Turkish it is 27.7\% (all with VecMap).
2. Compare BLI performance between typologically related languages and distant ones
Unsurprisingly, we confirm the intuition that typologically distant languages are harder to align than related ones. For example, Slovak-Czech accuracy is around 70%, Slovak-Russian is around 45%, but Slovak-Turkish accuracy drops to 31%.
An alternative to Bilingual lexicon induction is learning multilingual word embeddings (MWE), where one jointly aligns multiple embeddings spaces. The main approaches are conceptually similar: they use one of the spaces as a hub that remains invariant (you guessed it, the default is English), and align the rest of the spaces to it, typically enforcing some sort of agreement between all embedding spaces. For our analysis we use the state-of-the-art MAT+MPSR method of Chen and Cardie[6].
1. Does the hub language matter?
Yes! In the figure to the right, taken from an experiment where we aligned embedding spaces from 10 languages, we show that the accuracy on the Slovak-Galician dictionary can vary between as low as 20.6 and as high as 28.4, a statistically significant difference (with a standard deviation of more than 2 percentage points across all possible hub languages).
2. Is English the best hub language?
No! In our two experiments, aligning 10 and 8 languages respectively, we found that English is the best hub language for MWE in less than 20% of the cases. To be fair, English is also not the worst choice. Very low-resource languages like Galician or Belarusian tend to perform much worse overall as a hub, probably due to the lower quality of their pre-trained embeddings.
3. How much can we gain from picking the best hub language?
We show the expected gain in two MWE experiments from picking each language as the best hub, both overall and when the language is indeed the best choice for a hub, in the histograms to the right.
The way to interpret the histogram is as follows.
4. Can we somehow pick the best hub language before running the experiments?
Mmmmmmaybe. A finding we are confident in, is that if you care about a specific source-target language pair out of the multiple languages that you are aligning, then you shouldn't use neither the source nor the target as the hub, but another language.
There also seems to be a positive correlation between LI performance and distance measures between the source-hub and target-hub language pairs. We used the typological distance from the URIEL database[7], and found Pearson's correlation coefficients of 0.49 and 0.38 for our two MWE experiments.
We've seen so far that Lexicon Induction performance can vary based on the hub language choice. However, differences in LI performance do not necessarily translate to differences in the downstream tasks that use the aligned embeddings.
So, we performed a small experiment where we trained POS taggers in English and Portuguese, and then performed zero-shot tagging on other languages (Spanish and Galician) using the jointly learned embeddings form MWE.
The differences in performance (evaluated by POS tagging accuracy) are stark! Here are the results transfering from Portuguese to Galician, with each of these languages as hubs:
The main takeaway is that we should be thinking more about language diversity and the way that we construct our experiments. We should try to evaluate on as many languages as we can, and make that set of languages as diverse as possible. To this end, in the hope that we'll stir the community towards more challenging evaluation scenarios, we provide 4900 training and evaluation dictionaries (you can find them here).[8]
Regarding Multilingual Embeddings, we should consider the choice of the hub language as another hyper-parameter to be optimized, as it can have a large impact on the final results, both for lexicon induction and other downstream tasks.
Picking the best hub language a priori is not easy, and it is definitely a challenging future research direction! We provide all our experimental results with the paper, so perhaps someone could try to train a model to predict the best hub language, in the same way that Lin et al[9] train a model to choose the best transfer language.
Another interesting research direction would be to focus on minimizing the sensitivity of our machine learning approaches to hyperparameter choices (like the choice of the hub language) and devise techniques that work robustly across the board.