computing cosine-similarity between all texts in a corpus

Question

I have a set of documents stored in a JOSN file. Along this line, I retrieve them using the following code so that they are stored under the term data:

import json
with open('SDM_2015.json') as f:
    data = [json.loads(line) for line in f]

Integrating all texts into a single one to form the corpus is done by:

corpus = []
for i in range(len(data) -1):
    corpus.append(data[i]['body'] + data[i+1]['body'])

Until now pretty straightforward manipulations. To build the tfidf I use the following lines of codes which remove stop words and punctuation, stems each term and tokenize the data.

import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()

# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

## First function that creates the tokens
def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

## Lastly, a functionthat contains all the previous ones plus stopwords removal     

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

I then try to apply this function to the corpus such:

tfidf = vectorizer.fit_transform(corpus)

print(((tfidf*tfidf.T).A)[0,1])

But nothing happens, any idea of how to proceed?

Kind regards

What do you mean by "nothing happens"? Do you get an error, or an empty matrix? Is the `tfidf` matrix populated? — Jacopofar, Apr 27 '16 at 11:02
Good question @Jac_opo, I really mean literally nothing happens, which is very strange. Maybe to do with my computer rather than with the code.... but in any case, is probably not the right way to proceed :) — Economist_Ayahuasca, Apr 27 '16 at 11:05
So.. does it get stuck? It is a sparse matrix, and I suspect doing the transpose implicitly produce the raw, complete, one. Can you try with a small, dummy, corpus? — Jacopofar, Apr 27 '16 at 11:07
Good idea, I will try with a corpus of three articles. Thanks for the tip! — Economist_Ayahuasca, Apr 27 '16 at 11:09

computing cosine-similarity between all texts in a corpus

0 Answers0