I have build a random forest for multiclass text classification. The model returned an accuracy of 75 %. There are 6 labels, however out of the 6 classes, only 3 are classified and rest are not classified. I would really appreciate if anyone could let me know what went wrong.
Below are the steps i followed.
DATA PREPARATION
Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces,
stopwords and conversion to lower case.- Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
- Append the label column from the original dataset with the transformed dtm. The label column has 6 labels.
MODEL BUILDING
- Randomly sample the dtm and split it into a traning set and testing set.
- Build a base model of random forest with 7-fold cross validation.
Check for accuracy of the model on the training set and testing set.
I am sharing the link to the results ( if it is allowed here).