Clustering text data based on sentiment?

Question

I am scraping reviews off Amazon with the intent to perform sentiment analysis to classify them into positive, negative and neutral. Now the data I would get would be text and unlabeled.

My approach to this problem would be as following:-

1.) Label the data using clustering algorithms like DBScan, HDBScan or KMeans. The number of clusters would obviously be 3.

2.) Train a Classification algorithm on the labelled data.

Now I have never performed clustering on text data but I am familiar with the basics of clustering. So my question is:

Is my approach correct?
Any articles/blogs/tutorials I can follow for text based clustering since I am kinda new to this?

Have you had a look into https://www.nltk.org ? That will do the sentiment analysis for you itself :) — grumpyp, Dec 25 '21 at 11:21
I am familiar with the nltk library. But my issue is how to perform clustering on textual data. I am familiar with clustering on string and numerical data but not text data. — Gee, Dec 25 '21 at 13:05

score 1 · Answer 1 · answered Dec 25 '21 at 15:32

I have never done such an experiment but as far as I know, the most challenging part of this work is transforming the sentences or documents into fixed-length vectors (mapping into semantic space). I highly suggest using a sentiment analysis pipeline from huggingface library for embedding the sentences (in this way you might exploit some supervision). There are other options as well:

Using sentence-transformers library. (straightforward and still good)
Using BoW. (simplest way but hard to get what you want)
Using TF-IDF (still simple but may simply do the work)

After you reach this point (every review ==> fixed-length vector) you can exploit whatever you want to cluster them and look after the results.

Clustering text data based on sentiment?

1 Answers1