0

I am working on a recommendation engine and one problem I am facing right now is the similarity matrix of items are huge.

I calculated similarity matrix of 20,000 items and stored them a a binary file which tuned out to be nearly 1 GB. I think it is too big.

what is the best way do deal with similarity matrix if you have that many items?

Any advice!

arslan
  • 2,034
  • 7
  • 34
  • 61

1 Answers1

1

In fact similarity matrix is about how object similar to another objects. Each row consist of neighbors of object(row id), but you don't need to store all of neighbors, store for example only 20 neighbors. Use lil_matrix: from scipy.sparse import lil_matrix

rustohero
  • 38
  • 7
  • I later realized that I don't have to store similarity matrix at all, just compute it when recommending. The computation is not that slow as I thought because it needs to compute a very small part of the whole matrix in practice. – arslan May 11 '17 at 08:23
  • @rustohero do you know in case I have the similarities between products in a csr_matrix e.g row `(product_id1, product_id2) 0.45` how to filter on only the x most similar products to product_id1, without having to convert the matrix to an array? – SarahData Sep 13 '18 at 15:41