6

When using the linear_kernel or the cosine_similarity for TfIdfVectorizer, I get the error "Kernel died, restarting".

I am running the scikit learn functions for TfID method Vectorizer and fit_transform on some text data like the example below, but when I want to calculate the distance matrix, I get the error "Kernel died, restarting".

Whether I use the the cosine_similarity or the linear_kernel function:

tf = TfidfVectorizer(analyzer='word' stop_words='english')
tfidf_matrix = tf.fit_transform(products['ProductDescription'])

 --cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 --cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Maybe the problem is the size of my data?

My tiidf matrix is (178350,143529) which should generate a (178350,178350) cosine_sim matrix.

plamut
  • 3,085
  • 10
  • 29
  • 40
ana
  • 61
  • 1
  • 4

1 Answers1

0

As per as I understood, you want to calculate N x N similarity table.

In that case (csr matrix is quite large), it is hard to calculate at once, My approach was cosine_similarity(tfidf_matrix[index], tfidf_matrix[:]) * N times.

Actually I performed it with pyspark

def calculate_one_to_all_similarity(index):
    ...
    cosine_similarity(tfidf_matrix[index], tfidf_matrix[:]
rdd.map(lambda r: calculate_one_to_all_similarity(r2index[r]))
shaik moeed
  • 5,300
  • 1
  • 18
  • 54