I downloaded the pre-trained English Wikipedia vectors file (wiki.en.vec
) from the fastText Github repository page, and I tried to compute the syntactic and semantic analogy task accuracies as described in the first of Mikolov's word2vec papers as follows:
I built the word2vec repository by simply doing make
.
I ran ./compute-accuracy wiki.en.vec 0 < questions-words.txt
, i.e., I pass the pre-trained vectors file to the compute-accuracy binary from word2vec along with a threshold of 0 in order to consider the entire vocabulary instead of by default restricting it to 30000, and I also send in the accuracy computation dataset questions-words.txt
using <
because I noticed that the code reads the dataset from stdin.
In response, I simply get a bunch of NaNs like below. This doesn't change even if I change the threshold value to 30000 or anything else.
>capital-common-countries:
ACCURACY TOP1: 0.00 % (0 / 1)
Total accuracy: -nan % Semantic accuracy: -nan % Syntactic accuracy: -nan %
Can someone please explain why the English pre-trained vectors don't seem to work with word2vec's accuracy computation code? I took a look at compute-accuracy.c
and it does look like it expects standard vector file formatting convention and I took a look at wiki.en.vec
as well, and it does look like it is formatted in standard convention.
Also, in the fastText paper, word analogy accuracies with fastText vectors are presented and the paper cites Mikolov's word2vec paper there -- clearly, the same dataset was used, and presumably the same word2vec compute-accuracy.c
file was used to obtain the presented numbers. So could someone please explain what's going wrong?