3

I want to use word2vec to create my own word vector corpus with the current version of the english wikipedia, but I can't find an explanation of the command line parameter for using that program. In the demp-script you can find following:
(text8 is an old wikipedia corpus of 2006)

make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin

What is the meaning of the command line parameter:
vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

And what are the most suitable values when I have a wikipedia text corpus of around 20GB(.txt file)? I read that for bigger corpora a vector size of 300 or 500 would be better.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
Rainflow
  • 161
  • 1
  • 2
  • 5

2 Answers2

2

You can check main() of word2vec.c and the explanation of each options like the following can be found

printf("WORD VECTOR estimation toolkit v 0.1c\n\n");
printf("Options:\n");
printf("Parameters for training:\n");
printf("\t-train <file>\n");
printf("\t\tUse text data from <file> to train the model\n");...`

About the most suitable values, very sorry that I don't know the answer but you can find some hints from the paragraph 'Performance' of the source site(Word2Vec - Google Code) . It said,

 - architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
 - the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
 - sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
 - dimensionality of the word vectors: usually more is better, but not always
 - context (window) size: for skip-gram usually around 10, for CBOW around 5 
NathanOliver
  • 171,901
  • 28
  • 288
  • 402
guest
  • 21
  • 1
  • Thanks for answering. I didn't saw the information in the .c-file. Also there are a few information in the readme file, but no good explanation about the values itself. But I think I can work with that information, thanks again! – Rainflow Jun 09 '15 at 09:21
1

Parameters meaning:

-train text8: the corpus which you will train your model on

-output vectors.bin: after finishing learning your model save it in binary format to load and use it later

-cbow 1: activate the "continuous bag of words" option

-size 200: each word's vector will be represented in 200 values

For new users of word2vec you can use it's implementation in python through gensim

ndrwnaguib
  • 5,623
  • 3
  • 28
  • 51
Eyad Shokry
  • 151
  • 1
  • 9