1

I need to train some glove models to compare them with word2vec and fasttext output. It's implemented in C, and I can't read C code. The github is here.

The training corpus needs to be formatted into a single text file. For me, this would be >>100G -- way too big for memory. Before I waste time constructing such a thing, I'd be grateful if someone could tell me whether the glove algo tries to read the thing into memory, or whether it streams it from disk.

If the former, then glove's current implementation wouldn't be compatible with my data (I think). If the latter, I'd have at it.

generic_user
  • 3,430
  • 3
  • 32
  • 56

1 Answers1

0

Glove first constructs a word co-occurrence matrix and later works on that. While constructing this matrix, the linked implementation streams the input file on several threads. Each thread reads one line at a time.

The required memory will be mainly dependent on the amount of unique words in your corpus, as long as lines are not excessively long.

Boyan Hristov
  • 1,067
  • 3
  • 15
  • 41