-2

My use case is to vectorize words in two lists like below.

ListA = [Japan, Electronics, Manufacturing, Science]

ListB = [China, Electronics, AI, Software, Science]

I understand that word2vec and Glove can vectorize words but they do that through corpus or bag of words i.e we have to pass sentences which gets broken down to tokens and then it is vectorized.

Is there a way to just vectorize words in a list?

PS. I am new to NLP side of things, hence pardon any obvious points stated.

khelwood
  • 55,782
  • 14
  • 81
  • 108
Ridhima Kumar
  • 151
  • 3
  • 14

3 Answers3

0

What you might be looking for is simply pre-trained embeddings. Is that the case? If so, you can use this:

import spacy

nlp = spacy.load('en_core_web_md')
tokens = nlp(' '.join(ListA+ListB))

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))
Ethan Koch
  • 267
  • 1
  • 6
  • Hi @EthanKoch. Thanks for your answer. will the token1 and token 2 in for loop iterate over ListA and ListB and store their values ? I am getting an error "'str' object has no attribute 'text' when i run this for loop. – Ridhima Kumar Oct 29 '18 at 18:06
  • @RidhimaKumar very sorry -- I forgot to call nlp on the string – Ethan Koch Oct 29 '18 at 19:37
  • Hi @Ethan, No probs i figured that out. It is working nicely now. Just a final question. I got the word pairs with respective cosine values. How do I sort it in descending order of cosine values. for example I have Japan - china (0.3), Japan-electronics(0.5), Japan-AI(0.2). I want it to be arranged in descending order and print the second half of the pair i.e china, electronics, AI respectively. Once again Big thanks for helping out a NLP newbie like me. – Ridhima Kumar Oct 29 '18 at 19:58
  • I've added another answer in this question. Be sure to upvote! Happy to help. – Ethan Koch Oct 29 '18 at 21:21
0

Here is how you sort it in descending order of cosine values to answer your question in my other comment:

import spacy

nlp = spacy.load('en_core_web_md')
tokens = nlp(' '.join(ListA+ListB))
list_to_sort = []

for token1 in tokens:
    for token2 in tokens:
        list_to_sort.append((token1.text, token2.text, token1.similarity(token2))

sorted_list = sorted(list_to_sort, key=lambda x: x[2], reverse=True)
print(sorted_list)
Ethan Koch
  • 267
  • 1
  • 6
  • Hi @Ethan. The above code sorts the word pair by descending cosine value alright. How to get the top 3 similar words for each word. The output example would be inline 'china-Japan 0.7, china - electronics 0.6, china- science 0.6' – Ridhima Kumar Oct 30 '18 at 07:52
  • Hi @Ethan. The above code sorts the word pair by descending cosine value alright. How to get the top 3 similar words for each word. The output example would be 'china-Japan 0.7, china - electronics 0.6, china- science 0.6'. Basically for each word in ListB, the corresponding top 3 mapping is required. – Ridhima Kumar Oct 30 '18 at 08:02
0

I am assuming you wish to see the top 3 most similar words in ListA to for each word in ListB. If so, here is your solution (and if you want all top similar word to words in ListB, I added an optional line for that too):

import spacy

nlp = spacy.load('en_core_web_md')
tokensA = nlp(' '.join(ListA))
# use if wanting tokens in ListB compared to all tokens present: tokensA = nlp(' '.join(ListA+ListB))
tokensB = nlp(' '.join(ListB))

output_mapping = {tokenB.text: [] for tokenB in tokensB}
for tokenB in tokensB:
    for tokenA in tokensA:
        # add the tuple to the current list & sort by similarity
        output_mapping[tokenB.text].append((tokenA.text, tokenB.similarity(tokenA)))
        output_mapping[tokenB.text] = list(sorted(output_mapping[tokenB.text], key=lambda x: x[1], reverse=True))

for tokenB in sorted(output_mapping.keys()):
    # print token from listB and the top 3 similarities to list A, sorted
    print(tokenB, output_mapping[key][:3])
Ethan Koch
  • 267
  • 1
  • 6
  • Hi is there a way to add exception to tokens in Spacy's models? What i mean is that I have two lists `ListA = [Japan, Electronics, Manufacturing, Science, cloud] ListB = [China, Machine Learning, Artificial Intelligence, Software development, Science, cloud computing]` . Similarity gives me (cloud, cloud) instead of (cloud, cloud computing). Is there a way to allow the 'nlp' object to retain the tokens with spaces. Currently it seems to break the tokens into two if there is a space between. – Ridhima Kumar Nov 02 '18 at 09:37