I am trying to find word similarity between a list of 5 words and a list of 3500 words.
The problem that I am facing:
The List of 5 words I have are as below
List_five =['cloud','data','machine learning','virtual server','python']
In the list of 3500 words, there are words like
List_threek =['cloud computing', 'docker installation', 'virtual server'.....]
The Spacy models through their 'nlp' object seem to break the tokens in the second list into cloud, computing, docket, installation.
This in turn causes similar words to appear inaccurately, For example when I run the following code
tokens = " ".join(List_five)
doc = nlp(tokens)
top5 = " ".join(List_threek)
doc2 = nlp(top5)
similar_words = []
for token1 in doc:
list_to_sort = []
for token2 in doc2:
#print(token1, token2)
list_to_sort.append((token1.text, token2.text, token1.similarity(token2)))
I get results like (cloud, cloud) while I expected (cloud, cloud computing). It looks like the word 'cloud computing' is broken into two separate tokens.
Are there any workarounds? Any help is appreciated.
I would want an exception where contextually linked words like 'cloud computing' is not broken into two like 'cloud' , 'computing' but retained as 'cloud computing'