I modified the example script, to load the existing data of my language, read the file word2vec and at the end write all the content in a folder (this folder needs to exist).
Follow vectors_fast_text.py:
[LANGUAGE] = example: "pt"
[FILE_WORD2VEC] = "./data/word2vec.txt"
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
@plac.annotations()
def main():
nlp = spacy.load('[LANGUAGE]')
with open("[FILE_WORD2VEC]", 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
print("{} - {}".format(count, word))
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
nlp.to_disk("./models/new_nlp/")
if __name__ == '__main__':
plac.call(main)
Type in the terminal:
python vectors_fast_text.py
It will take about 10 minutes to finish, depending on the size of the word2vec file. In the script I made the print of the word, so that you can follow.
After that, you must type in the terminal:
python -m spacy package ./models/new_nlp/ ./my_models/
python setup.py sdist
And then you will have a "zip" file.
pip install /path/to/pt_example_model-1.0.0.tar.gz
A detailed tutorial can be found on the spaCy website:
https://spacy.io/usage/training