cosine similarity preprocesing task

Question

I have recently started with NLP. As part of cosine similarities calculation I have to complete the following task:

# Convert the sentences into bag-of-words vectors.
sent_1 = dictionary.doc2bow(sent_1)
sent_2 = dictionary.doc2bow(sent_2)
sent_3 = dictionary.doc2bow(sent_3)

I have more than 10000 different sentences (documents), so I want to generete a code which iterates automatically over documents. I have tried the following but it does not work:

sent_X = []
for i in documents:
    sent_X .append(dictionary.doc2bow(simple_preprocess(i)))

Thanks

What's the exected output? What dosn't work? – Riccardo Bucco Nov 22 '19 at 14:18 — Riccardo Bucco, Nov 22 '19 at 14:18

score 0 · Answer 1 · answered Nov 22 '19 at 19:53

I think your code works fine. I think the problem is that the resulting output isn't what you expect. So, let's see a simple example and see how this works:

>>> from gensim import corpora
>>> from gensim.utils import simple_preprocess

>>> documents = ["apple apple apple banana",
...              "hello hello this is a document"]
>>> dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
>>>
>>> sent_X = []
>>> for i in documents:
...     sent_X .append(dictionary.doc2bow(simple_preprocess(i)))
>>> sent_X
[[(0, 3), (1, 1)], [(2, 1), (3, 2), (4, 1), (5, 1)]]

I think this result (output of sent_X) caused your confusion. Let's see a more clarified result

>>> for doc in sent_X:
    print([[dictionary[id_], freq] for id_, freq in doc])
[['apple', 3], ['banana', 1]]
[['document', 1], ['hello', 2], ['is', 1], ['this', 1]]

cosine similarity preprocesing task

1 Answers1