Document clustering using Mean Shift

Question

I took a bunch of documents and calculated tf*idf for each token in all documents and created vectors(each of n dimension,n is the no. of unique words in corpus)for each document.I am unable to figure out how to create cluster from vectors using sklearn.cluster.MeanShift

After calculating tfidf, do you have a matrix (ie: table of data with rows and columns) of numeric values? Is it sparse or dense? What type in-general? Did you use TfidfVectorizer() from sklearn? — Jarad, Sep 12 '17 at 20:28
Yes ,I used TfidfVectorizer() ended up with a sparse matrix.I don't understand how to give that as an input to sklearn.clister.MeanShift — Mourya Vamsi, Sep 13 '17 at 01:22

score 1 · Answer 1 · answered Sep 13 '17 at 04:22

TfidfVectorizer converts documents to a "sparse matrix" of numbers. MeanShift requires the data being passed to it to be "dense". Below, I show how to convert it in a pipeline (credit) but, memory permitting, you could just convert a sparse matrix to dense with toarray() or todense().

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MeanShift
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

documents = ['this is document one',
             'this is document two',
             'document one is fun',
             'document two is mean',
             'document is really short',
             'how fun is document one?',
             'mean shift... what is that']

pipeline = Pipeline(
  steps=[
    ('tfidf', TfidfVectorizer()),
    ('trans', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
    ('clust', MeanShift())
  ])

pipeline.fit(documents)
pipeline.named_steps['clust'].labels_

result = [(label,doc) for doc,label in zip(documents, pipeline.named_steps['clust'].labels_)]

for label,doc in sorted(result):
  print(label, doc)

Prints:

0 document two is mean
0 this is document one
0 this is document two
1 document one is fun
1 how fun is document one?
2 mean shift... what is that
3 document is really short

You could modify the "hyperparameters" but this gives you a general idea I think.

what if the input is a csv with each row is the keyword or the sentence. I tried adding the following — sai, Jun 14 '18 at 22:10
import csv with open('4.csv', 'r', encoding='utf-8') as f: reader = csv.reader(f) your_list = list(reader) print(your_list) — sai, Jun 14 '18 at 22:13

Document clustering using Mean Shift

1 Answers1