I have a dataframe like this with columns - ["A","B","C",D"]
A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)
Sample df
A | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"
I have two questions here -
- I want to make a fifth column, say "Interest", based on the column "D" which would have 4 categories
["Liked", "Didn't like", "Average", "Cannot comment"]
. How could I do that?
--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.
- Since most of the columns are categorical and date-time, and one column as Text. How should I go ahead and do the feature engineering in this particular scenario to be able to feed to Kmeans?
How to get features out of column "D" which is a text feature?.
Should I convert column A to binary 0s a 1s?
Should I do one hot encoding/label encoding to the second column?
How to make use of the date-time feature in the clustering?
Things I have tried -
I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).
Based on this, I got clusters using kmeans.labels_
, but those clusters are numeric 1,2,3,4.
How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]
?
How can I use the text column efficiently to make the clusters?
Just short answers to my query would do. I don't need any implementation.