4

Q-1. How to change data of a corpus to appropriate format for training with 'caret' package?

First of all, i would like to give you some environments for this question and i will be show you where i am stuck.

Environments

This is corpus that is called rt. (R Code)

require(tm)
require(tm.corpus.Reuters21578) # to load data
data(Reuters21578)
rt<-Reuters21578

And the training Document-Term-Matrix is created from training corpus called dtmTrain. (R Code)

dtmTrain <- DocumentTermMatrix(rtTrain)

I have totally 10 classes for this project. The classes are in the metadatas of each document.

c("earn","acq","money-fx","grain","crude","trade","interest","ship","wheat","corn")

I have created a data frame from rt which has (documents x classes). It is called docLabels.

Docs  earn acq money-fx grain crude trade interest ship wheat corn
1    0   0        0     0     0     0        0    0     0    0
2    0   0        0     0     0     0        0    0     0    0
3    0   0        0     0     0     0        0    0     0    0
4    0   0        0     0     0     0        0    0     0    0
5    0   0        0     1     0     0        0    0     1    1
6    0   0        0     1     0     0        0    0     1    1

I assume that everything is clear so far.

Problem

I have a document-term-matrix which has datas and a data frame which has classes as you can see. Eventually, How can i merge these two data objects for training with 'caret' package?

Q-2. How to train multiclass data with 'caret' package?

If we change the data appropriately, after that, how to train the data with caret package?

This is from caret package documentation.

## S3 method for class 'formula'
train(form, data, ..., weights, subset, na.action, contrasts = NULL)

So, what should be the form ?

milos.ai
  • 3,882
  • 7
  • 31
  • 33

1 Answers1

2

Since you are working with matrices, you should consider the default method for caret::train rather than the formula interface. Note under ?train that you can pass the arguments like:

x: an object where samples are in rows and features are in columns. This could be a simple matrix...

y: a numeric or factor vector containing the outcome for each sample.

This will be simpler than building a formula. So let's discuss how to obtain x and y.

Getting x: We want to pass caret::train an x matrix with only those terms we want to use in the model. So we have to narrow the DocumentTermMatrix, which is a sparse matrix, down to those terms:

# You need to tell people where to find the file so your example is reproducible
install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)
rt <- Reuters21578

dtm <- DocumentTermMatrix(rt)

# these are the terms you care about
your_terms <- c("earn","acq","money-fx","grain","crude","trade",
                "interest","ship","wheat","corn")

your_columns <- which(tolower(dtm$dimnames$Terms) %in% your_terms) # only 8 are found

your_dtm <- as.matrix(dtm[,your_columns]) # unpack selected columns of sparse matrix

Getting y: Your question is not at all clear in terms of what your dependent variable is -- the thing you are trying to predict. For this answer I will show you how to predict whether the document includes one or more uses of the word "debt." If one of the classes in your_terms is actually your dependent variable, then remove it from your_terms and use it instead of "debt" in this example:

your_target <- as.integer(as.matrix(dtm[,'debt'])[,1] > 0) # returns array

Training a model in caret.

First we will create split the target vector and the explanatory matrix into 60/40 train/test sets.

library('caret')
set.seed(123)
train_rows <- createDataPartition(your_target, p=0.6) # for 60% training set

dtm_train <- your_dtm[train_rows,] 
y_train <- your_target[train_rows] 

dtm_test <- your_dtm[-train_rows,] 
y_test <- your_target[-train_rows] 

Now you need to decide kind of model(s) you want to try. For our example, we will use a lasso/ridge regression glmnet model. You should also try tree-based approaches such as rf or gbm.

Using the parallel backend is not strictly necessary but will speed up large jobs. Feel free to try this example without it.

tr_ctrl <- trainControl(method='repeatedcv', number=8, # train using 8-fold CV w/ 3 reps
                        repeats=3, returnResamp='none') 
library(parallel)
library(doParallel) # if using Windows, but for Linux/OSX use library(doMC) instead
use_cores <- detectCores()-1
cl <- makeCluster(use_cores)
  registerDoParallel(cl) # if using Windows, but for Linux/OSX use registerDoMC(cl)
  set.seed(123)
  glm <- train(x = dtm_train, y = y_train,         # You can ignore the warning about 
                  method='glmnet', trControl = t_ctrl)#  classification vs. regression.  
stopCluster(cl)

Of course there is a lot more tuning you could do here.

Testing the model. You can use AUC here.

library('pROC')
auc_train <- roc(y_train, 
                 predict(glm, newdata = dtm_train, type='raw') )
auc_test <- roc(y_test, 
                 predict(glm, newdata = dtm_test, type='raw') )
writeLines(paste('AUC using glm:', round(auc_train$auc,4),'on training/validation set',
                 round(auc_test$auc,4),'on test set.'))

Running this I get AUC using glm: 0.6389 on training/validation set 0.6552 on test set. So make sure to try other models and see if you can improve performance.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • I forgot to put `family="binomial"` in the example `train()`. You would probably get a better AUC by adding that parameter & treating this as a logistic rather than linear regression. – C8H10N4O2 Feb 06 '16 at 16:02