Add exception in Spacy tokenizer to not break the tokens with whitespaces?

Question

I am trying to find word similarity between a list of 5 words and a list of 3500 words.

The problem that I am facing:

The List of 5 words I have are as below

 List_five =['cloud','data','machine learning','virtual server','python']

In the list of 3500 words, there are words like

 List_threek =['cloud computing', 'docker installation', 'virtual server'.....]

The Spacy models through their 'nlp' object seem to break the tokens in the second list into cloud, computing, docket, installation.

This in turn causes similar words to appear inaccurately, For example when I run the following code

tokens = " ".join(List_five)
doc = nlp(tokens)

top5 = " ".join(List_threek)
doc2 = nlp(top5)

similar_words = []
for token1 in doc:
    list_to_sort = [] 
    for token2 in doc2:
    #print(token1, token2)
        list_to_sort.append((token1.text, token2.text, token1.similarity(token2)))

I get results like (cloud, cloud) while I expected (cloud, cloud computing). It looks like the word 'cloud computing' is broken into two separate tokens.

Are there any workarounds? Any help is appreciated.

I would want an exception where contextually linked words like 'cloud computing' is not broken into two like 'cloud' , 'computing' but retained as 'cloud computing'

score 4 · Accepted Answer · edited Nov 29 '18 at 09:02

Spacy also lets you do document similarity (averages word embeddings for words, but that is better than what you are doing now) - so, one way to approach this is to compare an item in list1 and list2 directly without doing it token by token. For example,

import spacy

nlp = spacy.load('en_core_web_sm')

l1 =['cloud','data','machine learning','virtual server','python']
l2=['cloud computing', 'docker installation', 'virtual server']
for item1 in l1:
   for item2 in l2:
       print((item1, item2), nlp(item1).similarity(nlp(item2)))

This will print me something like:

('cloud', 'cloud computing') 0.6696009166814865
('cloud', 'docker installation') 0.6003896898695236
('cloud', 'virtual server') 0.5484600148958506
('data', 'cloud computing') 0.3544642116905426
('data', 'docker installation') 0.4123695793059489
('data', 'virtual server') 0.4785382246303466
... and so on.

Is this what you want?

Add exception in Spacy tokenizer to not break the tokens with whitespaces?

1 Answers1