I'm trying to train Glove https://github.com/stanfordnlp/GloVe/blob/master/src/glove.c on a pretty big dataset, the newest wikidump (22G txt file). The total # of vocab that I'm training is 1.7 mil. Every file (shuffle, cooccur, vocab_count) until glove runs smoothly without any memory error. (My RAM = 64G)
However, when I ran glove, I'm getting "Segmentation fault (core dumped)".
aerin@capa:~/Desktop/GloVe/build$ ./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file glove300 -t-iter 25 -gradsq-file gradsq -verbose 2 -vector-size 300 -threads 1 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2
TRAINING MODEL
Read 1939406304 lines.
Initializing parameters...done.
vector size: 300
vocab size: 1737888
x_max: 100.000000
alpha: 0.750000
Segmentation fault (core dumped)
I tried with different # of threads as well: 1,2,4,8,16,32, etc. Nothing runs. Can someone please point me where to look?
Update
I cut the number of vocabulary from 1.7 million to 1 million and glove.c runs without "segmentation fault" error. So it is a memory error. But I would love to learn how to resolve this error and be able to train a model on the larger dataset! Any comment will be highly valued. Thanks.