5

I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.

If I am not wrong, TF-IDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tf-idf to clustering?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
alskndalsnd
  • 61
  • 1
  • 3

3 Answers3

4

Not exactly actually: tf-idf gives you the relevance of a term in a given document.
So you can perfectly use it for your clustering by computing a proximity which would be something like

proximity(document_i, document_j) = sum(tf_idf(t,i) * tf_idf(t,j))

for each term t both in doc i and doc j.

Regexident
  • 29,441
  • 10
  • 93
  • 100
pierroz
  • 7,653
  • 9
  • 48
  • 60
4

For document clustering. the best approach is to use k-means algorithm. If you know how many types of documents you have you know what k is.

To make it work on documents:

a) say choose initial k documents at random.

b) Assign each document to a cluser using the minimum distance for a document with the cluster.

c) After documents are assigned to the cluster make K new documents as cluster by taking the centroid of each cluster.

Now, the question is

a) How to calculate distance between 2 documents: Its nothing but cosine similarity of terms of documents with initial cluster. Terms here are nothing but TF-IDF(calculated earlier for each document)

b) Centroid should be: sum of TF-IDF of a given term/ no. of documents. Do, this for all the possible terms in a cluster. this will give you another n-dimensional documents.

Hope thats helps!

Kapil D
  • 2,662
  • 6
  • 28
  • 30
  • can you help on this http://stackoverflow.com/questions/28642930/how-can-i-compute-mtf-idf –  Feb 21 '15 at 07:08
  • so lets say i have 3 documents like this {1.1, 0, 3.3, 4} {0, 2, 0, 3} {1, 1, 1, 1} and their centroid is being {2.1/3, 3/3, 4.3/3, 8/3} right? – Furkan Gözükara Sep 06 '15 at 09:57
1

TF-IDF serves a different purpose; unless you intend to reinvent the wheel, you are better of using a tool like Carrot. Googling for document clustering can give you many algorithms if you wish to implement one on your own.

Mikos
  • 8,455
  • 10
  • 41
  • 72