I'm working on a NLP application, where I have a corpus of text files. I would like to create word vectors using the Gensim word2vec algorithm.
I did a 90% training and 10% testing split. I trained the model on the appropriate set, but I would like to assess the accuracy of the model on the testing set.
I have surfed the internet for any documentation on accuracy assessment, but I could not find any methods that allowed me to do so. Does anyone know of a function that does accuracy analysis?
The way I processed my test data was that I extracted all the sentences from the text files in the test folder, and I turned it into a giant list of sentences. After that, I used a function that I though was the right one (turns out it wasn't as it gave me this error: TypeError: don't know how to handle uri). Here is how I went about doing this:
test_filenames = glob.glob('./testing/*.txt')
print("Found corpus of %s safety/incident reports:" %len(test_filenames))
test_corpus_raw = u""
for text_file in test_filenames:
txt_file = open(text_file, 'r')
test_corpus_raw += unicode(txt_file.readlines())
print("Test Corpus is now {0} characters long".format(len(test_corpus_raw)))
test_raw_sentences = tokenizer.tokenize(test_corpus_raw)
def sentence_to_wordlist(raw):
clean = re.sub("[^a-zA-Z]"," ", raw)
words = clean.split()
return words
test_sentences = []
for raw_sentence in test_raw_sentences:
if len(raw_sentence) > 0:
test_sentences.append(sentence_to_wordlist(raw_sentence))
test_token_count = sum([len(sentence) for sentence in test_sentences])
print("The test corpus contains {0:,} tokens".format(test_token_count))
####### THIS LAST LINE PRODUCES AN ERROR: TypeError: don't know how to handle uri
texts2vec.wv.accuracy(test_sentences, case_insensitive=True)
I have no idea how to fix this last part. Please help. Thanks in advance!