0

I have a list of words, and I need to create a pairwise similarity matrix using the Fasttext word embedding. This is what I am currently doing:

from gensim.models import fasttext as ft
from sklearn.metrics import pairwise_distances

path='cc.en.300.bin'
model=ft.load_facebook_vectors(path, encoding='utf-8')

wordlist = [x for x in df_['word']]  # list of words from dataframe

wordlist_vec = [model[x] for x in df_['word']]  #get word vector
wd_arr = np.array(wordlist_vec).reshape(-1, 1)  # reshape to compute pairwise distance

distances = pairwise_distances(wd_arr, wd_arr, metric=model.similarity)  # pairwise distance matrix

this would yield a pairwise distance matrix using Gensim's cosine similarity function. Unfortunately, I get a memory error

Unable to allocate 1013. GiB for an array with shape (368700, 368700) and data type float64

I guess because it's trying to stock in memory all the vectors of the words (we are talking about ~1100 words, tops).

I am not sure which way to proceed here. Is there a native gensim function to create a similarity matrix starting from a list of words? Alternatively, what could be a clever way to get it?

sato
  • 768
  • 1
  • 9
  • 30

1 Answers1

0

The error clearly indicates that pairwise_distances() has been given 368,700 items whose distances should be calculated with 368,700 other items.

That would take (368700^2) * 8 bytes = 1013GB of RAM to cacluation, which your machine likely does not have, giving an error.

If you think it should be only "~1100 words, tops", take a look at your interim values – wordlist, wordlist_vec, & wd_arr – to make sure each is the size/shape/contents you intend.

(You may run into another issue when you fix that, though: I don't think model.similarity is of the exact type expected by pairwise_distances() metric parameter.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    yes you are correct: I finally solved by adopting an entirely different approach. I also identified a mistake in my attempt at formatting the array, which led to the 368,700 items. And you are also correct on your last point: model.similarity unfortunately is not suitable for the metric parameter. Might edit the question or delete it later, since I guess it's no help for anyone. Thanks a lot for your help. – sato Jun 08 '22 at 10:10