How can I find and print unmatched/dissimilar words from the documents(dataset)?

Question

I am trying to rewrite algorithm that basically takes a input text file and compares with different documents and results the similarities.

Now I want to print output of unmatched words and output a new textile with unmatched words.

From this code, "hello force" is the input and is checked against the raw_documents and prints out rank for matched document between 0-1(word "force" is matched with second document and ouput gives more rank to second document but "hello" is not in any raw_document i want to print unmatched word "hello" as not matched ), But what i want is to print unmatched input word that was not matched with any of the raw_document

import gensim
import nltk

from nltk.tokenize import word_tokenize

raw_documents = ["I'm taking the show on the road",
                 "My socks are a force multiplier.",
             "I am the barber who cuts everyone's hair who doesn't cut their own.",
             "Legend has it that the mind is a mad monkey.",
            "I make my own fun."]

gen_docs = [[w.lower() for w in word_tokenize(text)]
            for text in raw_documents]

dictionary = gensim.corpora.Dictionary(gen_docs)

corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]

tf_idf = gensim.models.TfidfModel(corpus)
s = 0
for i in corpus:
    s += len(i)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
                                      num_features=len(dictionary))
query_doc = [w.lower() for w in word_tokenize("hello force")]

query_doc_bow = dictionary.doc2bow(query_doc)

query_doc_tf_idf = tf_idf[query_doc_bow]
result = sims[query_doc_tf_idf]
print result

What specifically do you want your code to do that it isn't already doing? — Jordan Singer, Feb 05 '19 at 15:11
for now it prints out the matching rank between 0-1 but i want to print the input words from the input that are not matching with the documents(dataset) — Adil Shaik, Feb 05 '19 at 15:23
What do you mean by "not matching"? Can you give some examples of inputs and desired outputs? — gojomo, Feb 05 '19 at 15:57
so if "hello world" is input the program has to check the input against two or more documents and returns ranking of match(this is what the above code does) but if the input "hello world" match only one document (say "world") the code should print output unmatched word "hello" — Adil Shaik, Feb 06 '19 at 08:21

How can I find and print unmatched/dissimilar words from the documents(dataset)?

0 Answers0