Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.
Thanks,
Amulya
Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.
Thanks,
Amulya
You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.
The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.
The Python code to convert the file would then look something like this:
with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
line_count = '...' # line count of the tsv file (as string)
dimensions = '...' # vector size (as string)
outp.write(' '.join([line_count, dimensions]) + '\n')
for line in inp:
words = line.strip().split()
outp.write(' '.join(words) + '\n')
You can then import the new file into Gensim like so:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)
I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!
I've not been able to directly load the StarSpace embedding files using Gensim.
However, I was able to use the embed_doc
utility provided by StarSpace to convert my words/sentences into their vector representations.
You can read more about the utility here.
This is the command I used for the conversion:
$ ./embed_doc model train.txt > vectors.txt
This converts the lines from train.txt
into vectors and pipes the output into vectors.txt
. Sadly, this includes output from the command itself and the input lines again.
Finally, to load the vectors into Python I used the following code (it's probably not very pythonic and clean, sorry).
file = open('vectors.txt')
X = []
for i, line in enumerate(file):
should_continue = i < 4 or i % 2 != 0
if should_continue:
continue
vector = [float(chunk) for chunk in line.split()]
X.append(vector)
I have a similar workaround where I used pandas to read the .tsv file and then convert it into a dict where keys are words and value their embedding as lists.
Here are some functions I used.
in_data_path = Path.cwd().joinpath("models", "starspace_embeddings.tsv")
out_data_path = Path.cwd().joinpath("models", "starspace_embeddings.bin")
import pandas as pd
starspace_embeddings_data = pd.read_csv(in_data_path, header=None, index_col=0, sep='\t')
starspace_embeddings_dict = starspace_embeddings_data.T.to_dict('list')
from gensim.utils import to_utf8
from smart_open import open as smart_open
from tqdm import tqdm
def save_word2vec_format(fname, vocab, vector_size, binary=True):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vector_size : int
The number of dimensions of word vectors.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
"""
total_vec = len(vocab)
with smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in tqdm(vocab.items()):
if binary:
row = np.array(row)
word = str(word)
row = row.astype(np.float32)
fout.write(to_utf8(word) + b" " + row.tostring())
else:
fout.write(to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
save_word2vec_format(binary=True, fname=out_data_path, vocab=starspace_embeddings_dict, vector_size=100)
word_vectors = KeyedVectors.load_word2vec_format(out_data_path, binary=True)