2

I'm training a word2vec model on a corpus and then querying the model.

This works fine, but I am running an experiment and need to call the model for different conditions, save the model for each condition, query the model for each condition, and then save the output from the queries into a csv file, say, for further analyses of all the conditions.

I've studied the gensim documentation and searched around, but can't figure out what to do.

I asked the gensim folks and they said that since the result of "most_similar" is a python object I can save it with pickle or save as txt, csv, whatever format I want.

Sounds great, but I don't have a clue how to start. Here's my code - could you help me "fill in the blanks" even with something simple that I can research further and expand on my own?

#train the model
trained_model = gensim.models.Word2Vec(some hyperparamters)

#save the model in the format that is appropriate for querying by writing it to disk and call it stored_model
trained_model.save(some_filename)

#read in the stored model from disk and call it retrieved_model
retrieved_model = gensim.models.Word2Vec.load(some_filename)

#query the retrieved model
#each of these queries produces a tuple of 10 'word', cosine similarity pairs
retrieved_model.wv.most_similar(positive=['smartthings', 'amazon'], negative=['samsung'])
retrieved_model.wv.most_similar(positive=['light', 'nest'], negative=['hue'])
retrieved_model.wv.most_similar(positive=['shopping', 'new_york_times'], negative=['ebay'])
.
.
.
#store the results of all these queries in a csv so they can be analyzed.
?
profhoff
  • 1,017
  • 1
  • 13
  • 21
  • Can you provide an example of what your csv should look like? `most_similar` returns a list of tuples, something like `[('friend', 0.50342288), ...]`. Do you want one csv for each query (with two columns; 'word' and 'cosine similarity') or do you want a single csv for all queries? As for saving the model, you can use `trained_model.save(some_filename)` and reload with `retrieved_model = gensim.models.Word2Vec.load(some_filename)`. – WhoIsJack Mar 25 '18 at 01:00
  • @WhoIsJack, a csv for each query with two cols like "word" and "cosine similarity" would work with a way to describe which query it was, but even better would be a single csv for all the queries so that each major column represents a query with a header for that query and then the two sub-columns under each query with the "word" and "cos sim" cols. So, in the perfect world, a table. – profhoff Mar 25 '18 at 01:40

2 Answers2

3

As noted in my comment, you can save and load a model object like this:

# Save model
filename = 'stored_model.wv' # Can be any arbitrary filename
trained_model.save(filename) 

# Reload model
retrieved_model = gensim.models.Word2Vec.load(filename)

For retrieving multiple queries I recommend defining a list of queries and iterating over it to retrieve all the results.

# Define queries (this is the only user input required!)
my_queries = [{'positive' : ['smartthings','amazon'],
               'negative' : ['samsung']},
              {'positive' : ['light','nest'],
               'negative' : ['hue']},
               #<and so forth...>
              ]

# Initialize empty result list
query_results = []

# Collect query results
for query in my_queries:
    result = retrieved_model.wv.most_similar(**query)
    query_results.append(result)

Finally, you can use the list of results to write the csv file in the format you want. The header of the file can be constructed to represent the queries.

# Open the file
with open("my_results.csv", "w") as outfile:

    # Construct the header
    header = []
    for query in my_queries:
        head = 'pos:'+'+'.join(query['positive'])+'__neg:'+'+'.join(query['negative']) 
        # First resulting head: 'pos:smartthings+amazon__neg:samsung'
        header.append(head)

    # Write the header
    # Note the additional empty fields (,_,) because each head needs two columns
    outfile.write(",_,".join(header)+",_\n")

    # Write the second row to label the columns
    outfile.write(",".join(["word,cos_sim" for i in range(len(header))])+'\n')

    # Write the data
    for i in range(len(query_results[0])):
        row_results = [r[0]+','+str(r[1]) for r in query_results[i]]
        outfile.write(",".join(row_results)+'\n')

Note that this only works so long as each query retrieves the same number of items (which is the case by default but could be changed using the topn keyword argument for most_similar).

WhoIsJack
  • 1,378
  • 2
  • 15
  • 25
2

A simple way can be written as follows:

vocab, vectors = model.wv.vocab, model.wv.vectors

# get node name and embedding vector index.
name_index = np.array([(v[0], v[1].index) for v in vocab.items()])

# init dataframe using embedding vectors and set index as node name
df =  pd.DataFrame(vectors[name_index[:,1].astype(int)])
df.index = name_index[:, 0]
df.to_csv("embedding.csv")
David Buck
  • 3,752
  • 35
  • 31
  • 35
Rogers
  • 81
  • 1
  • 5