8

Does anyone know how to load a tsv file with embeddings generated from StarSpace into Gensim? Gensim documentation seems to use Word2Vec a lot and I couldn't find a pertinent answer.

Thanks,

Amulya

Matheus Lacerda
  • 5,983
  • 11
  • 29
  • 45
Just Data
  • 81
  • 1
  • 3

3 Answers3

3

You can use the tsv file from a trained StarSpace model and convert that into a txt file in the Word2Vec format Gensim is able to import.

The first line of the new txt file should state the line count (make sure to first delete any empty lines at the end of the file) and the vector size (dimensions) of the tsv file. The rest of the file looks the same as the original tsv file, but then using spaces instead of tabs.

The Python code to convert the file would then look something like this:

with open('path/to/starspace-model.tsv', 'r') as inp, open('path/to/word2vec-format.txt', 'w') as outp:
    line_count = '...'    # line count of the tsv file (as string)
    dimensions = '...'    # vector size (as string)
    outp.write(' '.join([line_count, dimensions]) + '\n')
    for line in inp:
        words = line.strip().split()
        outp.write(' '.join(words) + '\n')

You can then import the new file into Gensim like so:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('path/to/word2vec-format.txt', binary=False)

I used Gensim's word_vectors.similarity function to check if the model loaded correctly, and it seemed to work for me. Hope this helps!

Sascha
  • 41
  • 3
1

I've not been able to directly load the StarSpace embedding files using Gensim.

However, I was able to use the embed_doc utility provided by StarSpace to convert my words/sentences into their vector representations. You can read more about the utility here.

This is the command I used for the conversion:

$ ./embed_doc model train.txt > vectors.txt

This converts the lines from train.txt into vectors and pipes the output into vectors.txt. Sadly, this includes output from the command itself and the input lines again.

Finally, to load the vectors into Python I used the following code (it's probably not very pythonic and clean, sorry).

file = open('vectors.txt')
X = []

for i, line in enumerate(file):
    should_continue = i < 4 or i % 2 != 0

    if should_continue:
        continue

    vector = [float(chunk) for chunk in line.split()]

    X.append(vector)
Marc
  • 1,531
  • 1
  • 11
  • 14
  • 1
    "Sadly, this includes output from the command itself and the input lines again." Yes, I don't like this either. I ended up deleting the `cout << input << endl;` line from the `embedDoc` function in the `src/apps/embed_doc.cpp`; (plus `make embed_doc` again)... – SheepPerplexed Aug 09 '18 at 18:40
0

I have a similar workaround where I used pandas to read the .tsv file and then convert it into a dict where keys are words and value their embedding as lists.

Here are some functions I used.

in_data_path = Path.cwd().joinpath("models", "starspace_embeddings.tsv")
out_data_path = Path.cwd().joinpath("models", "starspace_embeddings.bin")

import pandas as pd 
starspace_embeddings_data = pd.read_csv(in_data_path, header=None, index_col=0, sep='\t')



starspace_embeddings_dict = starspace_embeddings_data.T.to_dict('list')
from gensim.utils import to_utf8
from smart_open import open as smart_open
from tqdm import tqdm

def save_word2vec_format(fname, vocab, vector_size, binary=True):
    """Store the input-hidden weight matrix in the same format used by the original
    C word2vec-tool, for compatibility.

    Parameters
    ----------
    fname : str
        The file path used to save the vectors in.
    vocab : dict
        The vocabulary of words.
    vector_size : int
        The number of dimensions of word vectors.
    binary : bool, optional
        If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.


    """
    
    total_vec = len(vocab)
    with smart_open(fname, 'wb') as fout:
        print(total_vec, vector_size)
        fout.write(to_utf8("%s %s\n" % (total_vec, vector_size)))
        # store in sorted order: most frequent words at the top
        for word, row in tqdm(vocab.items()):
            if binary:
                row  = np.array(row)
                word = str(word)
                row = row.astype(np.float32)
                fout.write(to_utf8(word) + b" " + row.tostring())
            else:
                fout.write(to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))

save_word2vec_format(binary=True, fname=out_data_path, vocab=starspace_embeddings_dict, vector_size=100)

word_vectors = KeyedVectors.load_word2vec_format(out_data_path, binary=True)
Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73