I want to use word2vec to create my own word vector corpus with the current version of the english wikipedia, but I can't find an explanation of the command line parameter for using that program. In the demp-script you can find following:
(text8 is an old wikipedia corpus of 2006)
make
if [ ! -e text8 ]; then
wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
fi
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin
What is the meaning of the command line parameter:
vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
And what are the most suitable values when I have a wikipedia text corpus of around 20GB(.txt file)? I read that for bigger corpora a vector size of 300 or 500 would be better.