Word2vec Gensim Accuracy Analysis

Question

I'm working on a NLP application, where I have a corpus of text files. I would like to create word vectors using the Gensim word2vec algorithm.

I did a 90% training and 10% testing split. I trained the model on the appropriate set, but I would like to assess the accuracy of the model on the testing set.

I have surfed the internet for any documentation on accuracy assessment, but I could not find any methods that allowed me to do so. Does anyone know of a function that does accuracy analysis?

The way I processed my test data was that I extracted all the sentences from the text files in the test folder, and I turned it into a giant list of sentences. After that, I used a function that I though was the right one (turns out it wasn't as it gave me this error: TypeError: don't know how to handle uri). Here is how I went about doing this:

test_filenames = glob.glob('./testing/*.txt')

print("Found corpus of %s safety/incident reports:" %len(test_filenames))

test_corpus_raw = u""
for text_file in test_filenames:
    txt_file = open(text_file, 'r')
    test_corpus_raw += unicode(txt_file.readlines())
print("Test Corpus is now {0} characters long".format(len(test_corpus_raw)))

test_raw_sentences = tokenizer.tokenize(test_corpus_raw)

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

test_sentences = []
for raw_sentence in test_raw_sentences:
    if len(raw_sentence) > 0:
        test_sentences.append(sentence_to_wordlist(raw_sentence))

test_token_count = sum([len(sentence) for sentence in test_sentences])
print("The test corpus contains {0:,} tokens".format(test_token_count))


####### THIS LAST LINE PRODUCES AN ERROR: TypeError: don't know how to handle uri 
texts2vec.wv.accuracy(test_sentences, case_insensitive=True)

I have no idea how to fix this last part. Please help. Thanks in advance!

gojomo · Answer 1 · 2018-10-10T21:17:36.400

The accuracy() method of a gensim word-vectors model (now disfavored in comparison to evaluate_word_analogies()) doesn't take your texts as input - it requires a specifically-formatted file of word-analogy challenges. This file is often named questions-words.txt.

This is a popular way to test general-purpose word-vectors, going back to the original Word2Vec paper and code-release from Google.

However, this evaluation doesn't necessarily indicate which word-vectors will be best for your needs. (For example, it's possible for a set of word-vectors to score better on these kinds of analogies, but be worse for a specific classification or info-retrieval goal.)

For good vectors for your own purposes, you should devise some task-specific evaluation, that gives a score correlated with the success on your final goal.

Also, note that as an unsupervised algorithm, word-vectors don't necessarily need a held-out test set to be evaluated. You generally want to use as much data as possible to train the word-vectors – ensuring maximal vocabulary coverage, with the most examples per word. Then you might test the word-vectors to some external standard – like the analogy questions, that weren't part of the training set at all.

Or, you'd just use the word-vectors as an additional input to some downstream task you're testing, and on that downstream task you'd withhold a test set from what's used to train some supervised algorithm. That ensures your supervised method isn't just memorizing/overfitting the labeled inputs, and gives you an indirect quality signal about whether that word-vector set helped the downstream task, or not. (And, that word-vector set could be compared against others based on how well they help that other supervised task – not against their own same unsupervised train-up step.)

score 0 · Answer 2 · answered Jul 29 '19 at 09:11

Gensim has various other metrics for testing your data, and using them, you could probably define your own functions in a few lines of code. For example, apart from the models.wv.analogy() and evaluate_word_analogies, there are function like evaluate_word_pairs, closer_than(), distance() , most_similar() etc ( See the docs for models.keyedvector for more details.) These are function that maybe used individually or as parts of larger functions for evaluating your word embeddings. Hope this helps!

Word2vec Gensim Accuracy Analysis

2 Answers2