Using K-means to cluster top topics in a dataset

Question

I'm trying to cluster twitter data using K-means to show the main topics discussed in datasets. I currently have a CSV file which has been cleaned, tokenised and with stop words being removed.

I am now trying to apply k-means through the use of a simple GUI which I wish to eventually visualise the results, it now is able to run but it only creates one cluster with the contents "text". How do I create multiply clusters?

My code:

def k_means_clustering(self):          

            df = pd.read_csv("test_data.csv")

            vectorizer = TfidfVectorizer(stop_words='english')
            X = vectorizer.fit_transform(df)

            true_k = 1 
            model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
            model.fit(X)

I used this question to try and apply K-means Clustering text documents using scikit-learn kmeans in Python

score 0 · Accepted Answer · answered Mar 01 '21 at 01:45

0

change the value of true_k will change the number of clusters generated by the KMeans function.

answered Mar 01 '21 at 01:45

nipun

672
5
11

If I change the true_k value it produces the following error: ValueError: n_samples=1 should be >= n_clusters=5. – Wynter Rose Mar 01 '21 at 14:02
Assuming you are using `sklearn` library; this error shows because there is no sufficient number of data to train. when you increase the `n_claster` value this `n_sample` value increases as well. for more [info](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) – nipun Mar 02 '21 at 01:50
it currently reads in around 15000 tweets. Not sure what to do to increase the sample size as suggested @nipun – Wynter Rose Mar 02 '21 at 17:25

Using K-means to cluster top topics in a dataset

1 Answers1