I am trying to load glove 100d emebddings in spacy nlp pipeline.
I create the vocabulary in spacy format as follows:
python -m spacy init-model en spacy.glove.model --vectors-loc glove.6B.100d.txt
glove.6B.100d.txt is converted to word2vec format by adding "400000 100" in the first line.
Now
spacy.glove.model/vocab has following files:
5468549 key2row
38430528 lexemes.bin
5485216 strings.json
160000128 vectors
In the code:
import spacy
nlp = spacy.load("en_core_web_md")
from spacy.vocab import Vocab
vocab = Vocab().from_disk('./spacy.glove.model/vocab')
nlp.vocab = vocab
print(len(nlp.vocab.strings))
print(nlp.vocab.vectors.shape) gives
gives 407174 (400000, 100)
However the problem is that:
V=nlp.vocab
max_rank = max(lex.rank for lex in V if lex.has_vector)
print(max_rank)
gives 0
I just want to use the 100d glove embeddings within spacy in combination with "tagger", "parser", "ner" models from en_core_web_md.
Does anyone know how to go about doing this correctly (is this possible)?