Im working on an NLP project whose target variable contains seven unique sentences which are "inspirational and thought-provoking ", "informative", "acknowledgment and appreciations" and 4 others. As for my understanding, the target variable as we can't establish a quantitative comparison between them. So my question is what is the best way to encode such variables? And if I encode it using one Hot-encoding then the problem will be of multi-class classification?
1 Answers
In classification it does not matter what the class actually represents, the learning algorithm treats every class as categorical anyway. In other words whether the names of the classes are strings, characters or numbers does not change anything to the model. This is why the most common choice is to simply represent the classes as integers: 1,2,3,... For example in scikit this can be done with LabelEncoder.
It would be a bad idea to use one hot encoding because this would make the problem multi-label. This would make the problem much more complex for the model and would very likely lead to lower performance, or it would require much more data in order to reach the same performance as regular classification. This is because there are much more combinations possible in the multi-label problem, and in this case this higher level of complexity is pointless since there can be only one class.

- 1,385
- 1
- 12
- 22