How to generate a new column based on some other column after clustering the data?

Question

I have a dataframe like this with columns - ["A","B","C",D"]

A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)

Sample df

A  | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"

I have two questions here -

I want to make a fifth column, say "Interest", based on the column "D" which would have 4 categories ["Liked", "Didn't like", "Average", "Cannot comment"]. How could I do that?

--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.

Since most of the columns are categorical and date-time, and one column as Text. How should I go ahead and do the feature engineering in this particular scenario to be able to feed to Kmeans?

How to get features out of column "D" which is a text feature?.

Should I convert column A to binary 0s a 1s?

Should I do one hot encoding/label encoding to the second column?

How to make use of the date-time feature in the clustering?

Things I have tried -

I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).

Based on this, I got clusters using kmeans.labels_, but those clusters are numeric 1,2,3,4.

How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]? How can I use the text column efficiently to make the clusters?

Just short answers to my query would do. I don't need any implementation.

score 1 · Accepted Answer · answered Mar 06 '21 at 17:37

To answer the second question first:

A: can be turned to binary

B: what information can you get from a list of unique strings by encoding? After encoding you are left with either the identity matrix(One-Hot) or a list of monotonically increasing ints (label encoding)

C: you might better transform to Timestamp unix epoch if the date range allows it, this allows you to caluclate distance properly.

D: This is the bread and butter of the project. Processing step is very complex but a short summary:

A basic recipe includes but is not limited to:

Text normalization:
- convert to lower or upper case
- converting numbers into words or removing numbers,
- removing punctuations, accent marks and other diacritics,
- removing leading or trailing white spaces
Corupus tokenization (Split each row into a list of single words)
- remove stop words, (a, the ..) they contain very litle information and are common
Stemming or Lemmatization. Tese reduce the words to a base form. Stemming is quite crude and could produce inavlid words, but is fast. Lemmatization produces valid words based on a dictionary, but is slower .... many more stuff n. Feature Extraction with TF-IDF, this is a sort of encoding that gives each word an importance score. This method works by increasing the weight of a word when it appears many times in a document, and lowering it’s weight when it’s common in many documents.

Example for td-idf:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

After these n steps, you get the answer to your first question; The output could look something like this :

You can find code on how to do all this stuff here (with NLTK). You might not be allowed to use NLTK however, in which case, you will have a hard time doing all these steps.

This is just what I wanted, I can't thank you enough. Thank you so much, sir. For the part of the question which I have a pinch of doubt left - Once I define the cluster labels, how could I assign the names to it. Say I get the clusters for first 5 rows as - [1,2,1,3,4] , Now, how could I actually assign if 1 is Liked, or 3 is Liked(same with "Disliked", "Average" and "Cannot comment"). Do I have to see most import words in each cluster(like described in the article linked) and then assign it manually(which cluster describes what?) — Kartik Mehra, Mar 06 '21 at 17:53
You can do that by getting the most common or most important word per cluster and using that. — Igna, Mar 06 '21 at 18:31
Hi. Thanks for your response. Can you please elaborate? I think if we get the important word per cluster then also it is not necessary that it would be one among ["Liked", "Didn't like", "Average", "Cannot comment"], which we want to make our additional column from. So, my understanding is, we need to check out the most important words per cluster using something like word clouds and based on that understanding, we ourself assign the columns that cluster 1- > Like, Cluste 3-> Didn't like, Cluster 4 -> Average and so on(with the help of something like dictionary). Is my understanding correct? — Kartik Mehra, Mar 06 '21 at 18:36

How to generate a new column based on some other column after clustering the data?

1 Answers1