2

I have build a random forest for multiclass text classification. The model returned an accuracy of 75 %. There are 6 labels, however out of the 6 classes, only 3 are classified and rest are not classified. I would really appreciate if anyone could let me know what went wrong.

Below are the steps i followed.

DATA PREPARATION

  • Creat a word vector for description.

  • Build a corpus using the word vector.

  • Pre-processing tasks such as removing number, whitespaces,
    stopwords and conversion to lower case.

  • Build a document term matrix (dtm).
  • Remove sparse words from the above dtm.

  • The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.

  • Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.

  • Append the label column from the original dataset with the transformed dtm. The label column has 6 labels.

MODEL BUILDING

  • Randomly sample the dtm and split it into a traning set and testing set.
  • Build a base model of random forest with 7-fold cross validation.
  • Check for accuracy of the model on the training set and testing set.

    I am sharing the link to the results ( if it is allowed here).

    http://rpubs.com/shanmukha_karthik/346007

Karthik Shanmukha
  • 417
  • 1
  • 5
  • 14
  • 1. Could you please show us a frequency table of label distribution on training set? As the other 3 labels have less proportion (hence unbalanced data), your training set may have really a few (or none) of those labels. 2. I haven't used rpart for multiclass purpose, but you haven't specify depth of trees, hence model may not split enough for all 6 labels. 3. Could you shows us the model summary to check the depth (or average) of trees? – Sixiang.Hu Jan 03 '18 at 10:48
  • @Sixiang.Hu i have added the frequency table of label distribution in training set. check this [link](http://rpubs.com/shanmukha_karthik/346007) – Karthik Shanmukha Jan 04 '18 at 03:44
  • TBH you might want to try using something different than a Random Forest - your dtm will be a sparse matrix, since in any document, most words in the corpus don't occur. Usually SVM or MNB work better https://www.aclweb.org/anthology/P12-2018 – TMrtSmith Feb 23 '18 at 09:58

1 Answers1

0

There could be many possibilities to increase the accuracy: 1. Try to increase the size of classes which have less than 1000 instances. 2. Try to use multiple remove sparse terms threshold such as; 0.991,0.99,0.999, etc. and check your accuracy accordingly 3. Use stemming, it used to give you the root form of words 4. You are only using Term frequency (TF) while creating your dtm. Try to use tfidf score as well by simply adding.

tdm <- DocumentTermMatrix(corpus,
       control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),stopwords = TRUE))

5. Try to use another package like; from mlr use rangers to train RandomForest.

I hope it works for you.