This is a pretty dumb question, but I couldn't find anywhere, so I will take my chances in here...
I'm building a classifier using CatBoost. Since this is a NLP problem, my features are the words/tokens in the tweet and the target is the classification. Basically, I have something like this:
tweet target
I was looking at her... happy
It's really hot today mad
Last Friday night was... sad
.
.
.
Due to company compliance, I can't share the dataset, but I guess you guys will understand and can even try using another dataframe (this one is very similar https://www.kaggle.com/datatattle/covid-19-nlp-text-classification). I have 5 classes as target and the dataset is imbalanced. The weights are:
happy: 0.80
neutral: 0.11
mad: 0.080
sad: 0.005
confused: 0.005
So, after splitting into training and test, stratified by the target, I was using this pipeline:
pipe = Pipeline([
(
"tokenizer",CountVectorizer(analyzer= 'word',
ngram_range=(1, 2),
token_pattern=r"\w+",
stop_words="english"
)),
("feature_selection", SelectKBest(SelectKBest, k=90)),
("clf", CatBoostClassifier())
])
pipe.fit(X_train, y_train)
Since the dataset is imbalanced, how can I use class_weights in here? I saw a tutorial doing something similar to this:
CatBoostClassifier(class_weights=[1-0.8, 1-0.11, 1-0.08, 1-0.005, 1-0.005])
But how do I know which one is the correct order?
I tried using the name, like class_weights={'happy': 1-0.8...}, but it didn't work as well.