2

I have a question about fasttext (https://fasttext.cc/). I want to download a pre-trained model and use it to retrieve the word vectors from text.

After downloading the pre-trained model (https://fasttext.cc/docs/en/english-vectors.html) I unzipped it and got a .vec file. How do I import this into fasttext?

I've tried to use the mentioned function as follows:

import fasttext
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

vectors = load_vectors('/Users/username/Downloads/wiki-news-300d-1M.vec')
model = fasttext.load_model(vectors)

However, I can't completely run this code because python crashes. How can I successfully load these pre-trained word vectors?

Thank you for your help.

Hansmagnetron
  • 35
  • 1
  • 1
  • 4

2 Answers2

7

FastText's advantage over word2vec or glove for example is that they use subword information to return vectors for OOV (out-of-vocabulary) words.

So they offer two types of pretrained models : .vec and .bin.

.vec is a dictionary Dict[word, vector], the word vectors are pre-computed for the words in the training vocabulary.

.bin is a binary fasttext model that can be loaded using fasttext.load_model('file.bin') and that can provide word vector for unseen words (OOV), be trained more, etc.

In your case you are loading a .vec file, so vectors is the "final form" of the data. fasttext.load_model expects a .bin file.

If you need more than a python dictionary you can use gensim.models.keyedvector (which handles any word vectors, such as word2vec, glove, etc...).

ygorg
  • 750
  • 3
  • 11
0

I use the following code to load the .vec file in Python 3, where PATH_TO_FASTTEXT is the path to the .vec file.

Most notably, the map needs to be explicitly cast to a list.


def LoadFastText():
    input_file = io.open(PATH_TO_FASTTEXT, 'r', encoding='utf-8', newline='\n', errors='ignore')
    no_of_words, vector_size = map(int, input_file.readline().split())
    word_to_vector: Dict[str, List[float]] = dict()
    for i, line in enumerate(input_file):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        vector = list(map(float, tokens[1:]))
        assert len(vector) == vector_size
        word_to_vector[word] = vector
    return word_to_vector
Eric McLachlan
  • 3,132
  • 2
  • 25
  • 37
  • How do you build a model out of those vectors then? I tried to use `load_model` for that and pass into vectors as a parameter but getting the following error: ```TypeError: loadModel(): incompatible function arguments. The following argument types are supported: 1. (self: fasttext_pybind.fasttext, arg0: str) -> None``` – Deil Jun 08 '23 at 19:56