Are fasttext Wiki word vectors monolingual?

Question

After reading your paper from Bojanowski et al. (2016), I went to consult the available pre-trained word vectors on the fasttext website.

Here is my concrete doubt:

Are these pre-trained word vectors (https://fasttext.cc/docs/en/pretrained-vectors.html) monolingual? Analogously, can you confirm that these pre-trained word vectors (https://fasttext.cc/docs/en/crawl-vectors.html) are multilingual?

I apologize if this has already been clarified somewhere, but I was unable to verify with 100% certainty.

Thanks in advance.

Reference: P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

The web pages make this perfectly clear, don't they? In each case, there is a set of vectors, each of which is monolingual. — Tim Roberts, Apr 07 '21 at 23:35
@TimRoberts I don't think is perfectly clear. For example, in the first link they show the following note: "...note that a newer version of multi-lingual word vectors are available at: Word vectors for 157 languages." Does that mean that the current word vectors are multi-lingual too? — mljistcart, Apr 07 '21 at 23:38

score 1 · Answer 1 · answered Apr 08 '21 at 00:32

The page at: https://fasttext.cc/docs/en/pretrained-vectors.html

provides 294 different language-labeled sets of vectors, each labeled with only a single language
describes the models as having been trained by "using the skip-gram model described in Bojanowski et al. (2016) with default parameters" - a paper which does not describe the creation of multilingual vectors

Thus it's safe to assume none of them are explicitly multilingual. (If one or more were, wouldn't they be clearly labeled that way?)

Similarly, considering the page at: https://fasttext.cc/docs/en/crawl-vectors.html

does not include the word 'multlingual' anywhere in the page text
provides 158 different language-labeled sets of vectors, each labeled with only a single language

Thus I also think it's safe to assume none of them are explicitly multilingual. (If you thought one or more of them were, try downloading them and see if they give good results across whatever multiple languages you're conjecturing, in the absence of descriptions, they might cover.)

I believe the quote you've highlighted, "…a newer version of multi-lingual word vectors are available at…", is using 'multi-lingual word vectors' loosely as 'multiple language word vectors', and describing the total contents of the page, not any single download.

Note that there is later work which aligns alternate-langauge sets of word-vectors, such that the same(ish) meanings have simialr coordinates:

https://fasttext.cc/docs/en/aligned-vectors.html

However, even there, each language's vectors are provided as a single download.

There are so many colliding-tokens, and colliding subwords, that mean very different things across different languages that it would be hard to provide a usable single model for multiple languages, that considerd individual word-tokens alone (without full context that provides extra author-intended-language hints).

thanks for your answer. For the first link, you said that it would be safe to assume none of them are explicitly mulingual. So it would be safe to assume all of them are monolingual? Please, check also this link (https://github.com/facebookresearch/MUSE#get-monolingual-word-embeddings) where they mention them as being monolingual. — mljistcart, Apr 08 '21 at 09:45
It indicates each set was trained to cover a certain single target language. (In the case of the Wikipedia-text trained vectors, from the Wikipedia articles that claim to be in that language.) But given the way many languages are used together in real texts, each probably has some words from other lnaguages. That's why I'd say none are "explicitly multilingual". But calling them strictly 'monolingual' might be too much - you'd have to run experiments. What is your goal, in making the distinction? — gojomo, Apr 08 '21 at 17:51
My goal is to run experiments using both monolingual and multilingual and then compare the results. Moreover, I also have the purpose of experimenting with the cross-lingual transfer of word embeddings, which implies understanding whether they are multilingual or monolingual. — mljistcart, Apr 08 '21 at 21:26
What's an example of "multilingual" vectors in your case? (Even the FastText cross-lingual 'aligned' vectors are in separate file, but compatible coordinate-spaces, & thus using them to analyze new texts requires a deduced/assumed language, to pick which set to use.) Ultimately, it may be an oversimplification to describe any set of vectors as conclusively 'monolingual' or 'multlingual'. Instead, we just know what they were trained on - which has typically been texts categorized as a certain language, but which practically often includes a smattering of other-language words as well. — gojomo, Apr 09 '21 at 01:45

Are fasttext Wiki word vectors monolingual?

1 Answers1