Arranging documents in a grid in accordance with the content similarity

Question

How is it possible to arrange documents in to a space (say multiple grids), so that the position in which they are placed in, contains information about how similar they are to other documents. I looked in to K-means clustering, but it is a bit computationally intensive if data is large. I'm looking for something like hashing the contents of the document, so that they can fit in a large space and documents that are similar would be having similar hashes and distance between them would be small. In this case, it would be easy to find documents similar to a given document, with out doing much extra work.

The result could be something similar to the picture below. In this case music documents are near film documents but far from documents related to computers. The box can be considered as the whole world of documents.

enter image description here

Any help would be greatly appreciated.

Thanks

jvc007

the picture link is broken. – rocksportrocker Apr 19 '13 at 11:41 — rocksportrocker, Apr 19 '13 at 11:41
looks like mds based plotting as I described below. – rocksportrocker Apr 19 '13 at 11:49 — rocksportrocker, Apr 19 '13 at 11:49

score 4 · Accepted Answer · answered Apr 19 '13 at 11:49

One way to introduce a distance or similarity measure between documents is:

first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.

Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities should help to visualize the documents in a two dimensional plot.

score 2 · Answer 2 · answered Apr 19 '13 at 12:52

2

The problem of mapping high-dimensional data to low dimensional space while preserving similarity can be solved using Self-organizing map (SOM or Kohonen network). I have already seen some applications on documents.

I don't know about any python implementation (there might be one), but there is a good one for Matlab (SOM toolbox).

answered Apr 19 '13 at 12:52

Josef Borkovec

1,069
8
13

If we are using SOM, the input vector size when dealing with documents would be very large, right ?. Can this be used for large scale document classification ?. Thanks – jvc Apr 20 '13 at 10:24
You can use [LSI](http://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html) to reduce dimensionality to a manageable number. – Josef Borkovec Apr 22 '13 at 18:43

score 0 · Answer 3 · edited May 23 '17 at 12:04

0

I think what you're looking for is locality-sensitive hashing. See this answer for a nice, graphical explanation and sample code.

edited May 23 '17 at 12:04

Community

1
1

answered Apr 19 '13 at 13:56

Fred Foo

355,277
75
744
836

Arranging documents in a grid in accordance with the content similarity

3 Answers3