-2

I downloaded the pre-trained English Wikipedia vectors file (wiki.en.vec) from the fastText Github repository page, and I tried to compute the syntactic and semantic analogy task accuracies as described in the first of Mikolov's word2vec papers as follows:

I built the word2vec repository by simply doing make.

I ran ./compute-accuracy wiki.en.vec 0 < questions-words.txt, i.e., I pass the pre-trained vectors file to the compute-accuracy binary from word2vec along with a threshold of 0 in order to consider the entire vocabulary instead of by default restricting it to 30000, and I also send in the accuracy computation dataset questions-words.txt using < because I noticed that the code reads the dataset from stdin.

In response, I simply get a bunch of NaNs like below. This doesn't change even if I change the threshold value to 30000 or anything else.

>capital-common-countries:
ACCURACY TOP1: 0.00 % (0 / 1)
Total accuracy: -nan % Semantic accuracy: -nan % Syntactic accuracy: -nan %

Can someone please explain why the English pre-trained vectors don't seem to work with word2vec's accuracy computation code? I took a look at compute-accuracy.c and it does look like it expects standard vector file formatting convention and I took a look at wiki.en.vec as well, and it does look like it is formatted in standard convention.

Also, in the fastText paper, word analogy accuracies with fastText vectors are presented and the paper cites Mikolov's word2vec paper there -- clearly, the same dataset was used, and presumably the same word2vec compute-accuracy.c file was used to obtain the presented numbers. So could someone please explain what's going wrong?

polm23
  • 14,456
  • 7
  • 35
  • 59
Ricky
  • 103
  • 1
  • 2
  • 10
  • To people down-voting this: please also comment here about what the question lacks, so that I can make the needed improvements. Simply down-voting my question without giving feedback to improve is not fair. – Ricky Jul 15 '17 at 02:16

1 Answers1

0

Does compute-accuracy work on locally-trained vectors? (That is, is your setup working without adding the extra variable of Facebook-sourced vectors.)

If so, then does the locally-trained vector set that works with `computer-accuracy' appear to be the same format/encoding as the Facebook-downloaded file?

If I understand correctly, the .vec files are a text-format. The example of using the compute-accuracy executable inside the word2vec.c repository indicates passing binary-format vectors as the argument. See:

https://github.com/tmikolov/word2vec/blob/master/demo-word-accuracy.sh#L7

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks for the reply! I haven't gotten `compute-accuracy` to work with any of my locally-trained vectors either (all of which are `.vec` files). Yes, I noticed that the file `compute-accuracy.c` reads the input file in `rb` mode, and so I even changed the code to `r` and tried, it still didn't help. Please see line 41: https://github.com/tmikolov/word2vec/blob/master/compute-accuracy.c#L41 – Ricky Jul 15 '17 at 02:21
  • I also just converted my file from `.vec` to `.txt` using `mv file.vec file.txt` and then converted the `.txt` file to a binary `.bin` file using the following tool called convertvec: https://github.com/marekrei/convertvec I then passed the `.bin` file to the `compute-accuracy` executable -- that didn't help either. Please help, I've been stuck on this for way too long! – Ricky Jul 15 '17 at 02:39
  • I'm unfamiliar with that `convertvec` tool so can't comment on its reliability or appropriateness. I suggest trying to get things to work with locally-trained, saved-to-binary-mode vectors first – like in the `demo-word-accuracy.sh` script – to be sure your local word2vec tools are even working. Only after that's working, try extra steps to convert FB's `.vec` to the original word2vec `.bin`. – gojomo Jul 15 '17 at 04:00
  • In last days I was using `convertvec` and it sometimes fails to convert `.vec` to binary format. Yo need to verify the header of the file. Unfortunately `convertvec` does not print logging or possible errors. Also try with other machine.. – Nacho Oct 04 '17 at 21:03