Evgeniia Razumovskaia bio photo

Evgeniia Razumovskaia

PhD student
University of Cambridge

Email Twitter GitHub

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

These notes are the second series of my retrospective of multilingual sentence encoders. You can find previous one here.

The notes are on the paper Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond by Mikel Artetxe and Holger Schwenk.

Preface
๐Ÿ”ฒ The overall problem is the same across this series: we are trying to have a sentence encoder which works well across tasks, languages and domains.

Synopsis of the previous episode
โฎ๏ธ We have a sentence encoder which works well in several languages and several domains but it was only tested on the 16 languages it was trained on.

Problem
โ” Can we train a sentence encoder which is general with respect to input language, script and NLP task we use it for?

What is the solution proposed?
Model
๐Ÿ”ฌThe model is a sequence to sequence model trained for translation.
๐Ÿ”ฌ Strong sentence representation is obtained by passing only the overall sentence encoding to the decoder (unlike the usual translation pipeline where the decoder has access to the encoding of each token in the sentence).
๐Ÿ”ฌ Each input sequence was jointly translated only into two languages (en, es) and not all 93 languages for two reasons -- no need for 93-way parallel corpus, no need to train 93 decoders.
๐Ÿ”ฌ The architecture is a a biLSTM preceded by BPE encodings.

Data
๐Ÿ’ฝ A set 223 million parallel sentences -- a combination of United Nations, OpenSubs etc.
๐Ÿ’ฝ The 93 languages come from more than 30 language families and written in 28 different scripts.

Results
โญ Strong performance on XNLI (14 languages within the 93), MLDOC (document classification, 7 languages), BUCC (finding parallel sentences in large corpora);
โญ NEW dataset: Similarity search corpus in 112 languages; obtained to assess the model's performance on out-of-training-dataset languages.
โญ The dataset shows that the model generalizes well to out of training data languages (zero shot cross-lingual performance).
โญ In an ablation study they prove the effectiveness of joint training.

And?
๐Ÿ’ญ We have a strong Language Agnostic sentence encoder; the model can potentially encode sentences in any language. However, it is dependent on language specific preprocessing.
๐Ÿ’ญ A new dataset for sentence similarity search in 112 languages proposed (Tatoeba).
๐Ÿ’ญ Joint training works well for cross-lingual performance.