3

I am looking to compute similarities between users and text documents using their topic representations. I.e. each document and user is represented by a vector of topics (e.g. Neuroscience, Technology, etc) and how relevant that topic is to the user/document.

My goal is then to compute the similarity between these vectors, so that I can find similar users, articles and recommended articles.

I have tried to use Pearson Correlation but it ends up taking too much memory and time once it reaches ~40k articles and the vectors' length is around 10k.

I am using numpy.

Can you imagine a better way to do this? or is it inevitable (on a single machine)?

Thank you

user1491915
  • 1,067
  • 1
  • 14
  • 19

3 Answers3

3

I would recommend just using gensim for this instead of rolling your own.

Robert Kern
  • 13,118
  • 3
  • 35
  • 32
  • Just to understand: what gensim would do for me is dimensionality reduction (using lsi or lda), right? It would still use something like pearson's correlation to find similarities, right? – user1491915 Oct 04 '12 at 12:29
2

Don't quite understand why you end up taking too much memory for just computing the correlation for O(n^2) pair of items. To calculate Pearson Correlation, as wikipedia article pointed out, enter image description here

That is, to get the corr(X,Y) you need only two vectors at a time. If you process your data one pair at a time, memory should not be a problem at all.

If you are going to load all vectors and do some matrix factorization, that is another story.

For computation time, I totally understand because you need to compare this for O(n^2) pair of items.


Gensim is known to be able to run with modest memory requirements (< 1 Gb) on a single CPU/desktop computer within a reasonable time frame. Check this about an experiment they have done on a dataset of 8.2GB using MacBook Pro, Intel Core i7 2.3GHz, 16GB DDR3 RAM. I think it is a larger dataset than you have.


If you have a even larger dataset, you might want to try distributed version of gensim or even map/reduce.

Another approach is to try locality sensitive hashing.

Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
0

My tricks are using a search engine such as ElasticSearch, and it works very well, and in this way we unified the api of all our recommend systems. Detail is listed as below:

  • Training the topic model by your corpus, each topic is an array of words and each of the word is with a probability, and we take the first 6 most probable words as a representation of a topic.
  • For each document in your corpus, we can inference a topic distribution for it, the distribution is an array of probabilities for each topic.
  • For each document, we generate a fake document with the topic distribution and the representation of the topics, for example the size of the fake document is about 1024 words.
  • For each document, we generate a query with the topic distribution and the representation of the topics, for example the size of the query is about 128 words.

All preparation is finished as above. When you want to get a list of similar articles or others, you can just perform a search:

  • Get the query for your document, and then perform a search by the query on your fake documents.

We found this way is very convenient.

Mountain
  • 211
  • 3
  • 11
  • How about recommendations though? How to do you represent your users? – user1491915 Feb 25 '13 at 13:21
  • I think you can represent user as pages he visited, and one visit count as one time. This should be feasible. – Mountain Feb 25 '13 at 13:31
  • About your out of memery issue: "the vectors' length is around 10k" that means you have ~10, 000 topics, this is too big, I think. In our cases, we have 20k docs and only 512 topics, and we found the result is acceptable. – Mountain Feb 25 '13 at 13:35
  • check with my new project https://github.com/guokr/simbase , it help you compute similarities of vector sets. – Mountain Jun 12 '14 at 15:27