Reading a large pre trained fastext word embedding file in python

Question

I am doing sentiment analysis and I want to use pre-trained fasttext embeddings, however the file is very large(6.7 GB) and the program takes ages to compile.

fasttext_dir = '/Fasttext'
embeddings_index = {}
f = open(os.path.join(fasttext_dir, 'wiki.en.vec'), 'r', encoding='utf-8')
for line in tqdm(f):
    values = line.rstrip().rsplit(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('found %s word vectors' % len(embeddings_index))

embedding_dim = 300

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

Is there any way to speedup the process?

Takes around 8 min. to read and process the file. Even after that the computer remains very laggy for a long tiime — BlueMango, Jan 19 '19 at 16:37
It's a normal time for python script and this task. But you can speed it up using another language. Probably it also consumes a lot of error. Try to pickle and delete (via `del`) unnecessary data (`embeddings_index`, when it processed, etc) — Mikhail Stepanov, Jan 19 '19 at 16:49

score 6 · Answer 1 · answered Jan 19 '19 at 16:43

You can load the pretrained embeddings with gensim instead. At least for me this was much faster. First you need to pip install gensim and then you can load the model with the following line of code:

from gensim.models import FastText

model = FastText.load_fasttext_format('cc.en.300.bin')

(I'm not sure if you need the .bin file for this, maybe the .vec file also works.)

To get the embedding of a word with this model, simply use model[word].

It needs the .bin. I am going to try that now – BlueMango Jan 20 '19 at 12:37 — BlueMango, Jan 20 '19 at 12:37

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

I recommend you to use .bin models, but if it doesn't exist and you only have .vec or .txt, try to parallelize the process using Joblib:

from joblib import Parallel, delayed
from tqdm import tqdm

if __name__ == '__main__':
    embeddings_index = {}

    f = open(os.path.join('D:/multi_lingual', 'wiki.en.align.vec'), 'r', encoding='utf-8')
    def loading(line):
        values = line.rstrip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        return word, coefs

    embeddings_index = dict(Parallel(n_jobs=-1)(delayed(loading)(line) for line in tqdm(f)))
    f.close()
    print(len(embeddings_index))

by monitoring tqdm progress bar, I noticed the amount of improvement:

without parallelization: 10208.44it/s

with parallelization: 23155.08it/s

I'm using 4 cores CPUz .. The results are not totally precise because I was using the processor on other stuff. Maybe you can notice better improvements.

The other point is, I recommend you after reading the required words to save them where you can load them in the next time instead of loading the whole embeddings file each time.

score 0 · Answer 3 · answered Oct 20 '20 at 19:50

0

You can also load it once, save it as a pickle, and then load the pickle. Loading pickle files is fastest in python.

answered Oct 20 '20 at 19:50

Arrabi

539
6
8

Reading a large pre trained fastext word embedding file in python

3 Answers3