7

I'm working on a NLP application, where I have a corpus of text files. I would like to create word vectors using the Gensim word2vec algorithm.

I did a 90% training and 10% testing split. I trained the model on the appropriate set, but I would like to assess the accuracy of the model on the testing set.

I have surfed the internet for any documentation on accuracy assessment, but I could not find any methods that allowed me to do so. Does anyone know of a function that does accuracy analysis?

The way I processed my test data was that I extracted all the sentences from the text files in the test folder, and I turned it into a giant list of sentences. After that, I used a function that I though was the right one (turns out it wasn't as it gave me this error: TypeError: don't know how to handle uri). Here is how I went about doing this:

test_filenames = glob.glob('./testing/*.txt')

print("Found corpus of %s safety/incident reports:" %len(test_filenames))

test_corpus_raw = u""
for text_file in test_filenames:
    txt_file = open(text_file, 'r')
    test_corpus_raw += unicode(txt_file.readlines())
print("Test Corpus is now {0} characters long".format(len(test_corpus_raw)))

test_raw_sentences = tokenizer.tokenize(test_corpus_raw)

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

test_sentences = []
for raw_sentence in test_raw_sentences:
    if len(raw_sentence) > 0:
        test_sentences.append(sentence_to_wordlist(raw_sentence))

test_token_count = sum([len(sentence) for sentence in test_sentences])
print("The test corpus contains {0:,} tokens".format(test_token_count))


####### THIS LAST LINE PRODUCES AN ERROR: TypeError: don't know how to handle uri 
texts2vec.wv.accuracy(test_sentences, case_insensitive=True)

I have no idea how to fix this last part. Please help. Thanks in advance!

Sam
  • 641
  • 1
  • 7
  • 17

2 Answers2

10

The accuracy() method of a gensim word-vectors model (now disfavored in comparison to evaluate_word_analogies()) doesn't take your texts as input - it requires a specifically-formatted file of word-analogy challenges. This file is often named questions-words.txt.

This is a popular way to test general-purpose word-vectors, going back to the original Word2Vec paper and code-release from Google.

However, this evaluation doesn't necessarily indicate which word-vectors will be best for your needs. (For example, it's possible for a set of word-vectors to score better on these kinds of analogies, but be worse for a specific classification or info-retrieval goal.)

For good vectors for your own purposes, you should devise some task-specific evaluation, that gives a score correlated with the success on your final goal.

Also, note that as an unsupervised algorithm, word-vectors don't necessarily need a held-out test set to be evaluated. You generally want to use as much data as possible to train the word-vectors – ensuring maximal vocabulary coverage, with the most examples per word. Then you might test the word-vectors to some external standard – like the analogy questions, that weren't part of the training set at all.

Or, you'd just use the word-vectors as an additional input to some downstream task you're testing, and on that downstream task you'd withhold a test set from what's used to train some supervised algorithm. That ensures your supervised method isn't just memorizing/overfitting the labeled inputs, and gives you an indirect quality signal about whether that word-vector set helped the downstream task, or not. (And, that word-vector set could be compared against others based on how well they help that other supervised task – not against their own same unsupervised train-up step.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
0

Gensim has various other metrics for testing your data, and using them, you could probably define your own functions in a few lines of code. For example, apart from the models.wv.analogy() and evaluate_word_analogies, there are function like evaluate_word_pairs, closer_than(), distance() , most_similar() etc ( See the docs for models.keyedvector for more details.) These are function that maybe used individually or as parts of larger functions for evaluating your word embeddings. Hope this helps!