3

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.

I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.

I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.

milos.ai
  • 3,882
  • 7
  • 31
  • 33
Ed.
  • 846
  • 6
  • 24

1 Answers1

7

At what moment did you reach ram constraints?

quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).

Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.

Following method will work good for text classification:

  1. logistic regression with L1 penalty (see glmnet package)
  2. Linear SVM (see LiblineaR, but worth to serach for alternatives)
  3. Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • (took to much to edit that answer...) I have been taking a look at the text2vec package and sounds interesting to me. I am considering to give it a try but I would need to understand how can I then use the DTM created to perform via classification - I need to be using caret -, as this object as I understand is a dgCMatrix or dgTMatrix. Could I be able to directly feed this object to the train function in caret? Thanks! – Ed. Aug 04 '16 at 09:12
  • 1
    `dgCMatrix` from `Matrix` package is the "standart" for sparse matrices in R. I didn't try `caret`, but you can be interested in this topic: https://github.com/topepo/caret/issues/31 . Seems caret supports sparse matrices out of the box. – Dmitriy Selivanov Aug 04 '16 at 09:58
  • 1
    In fact `quanteda`s `dfm-class` inherits from `dgCMatrix-class`. So if your code works with `dfm-class`, in most cases it will work with `dgCMatrix` as well. – Dmitriy Selivanov Aug 04 '16 at 10:06
  • Thanks! Im working on it right now. Crossing fingers! – Ed. Aug 04 '16 at 10:13
  • I get an error with caret and I'm not able to understand why it is complaining as I find little info around. The error is: Error in { : task 2 failed - "'n' must be a positive integer >= 'x'" And it happens when I launch the train function. I guess I should open a new question to ask and put some source code for this? – Ed. Aug 04 '16 at 10:46
  • 1
    I think so. But plz provide simple reproducible example with full pipeline. – Dmitriy Selivanov Aug 04 '16 at 10:53
  • I am afraid that it was a problem of tremendous disbalance in the dataset, so for now I will close the question. Thanks a lot Dmitriy! – Ed. Aug 04 '16 at 10:57
  • also plz, use dev version of text2vec from github. I recommend 0.4 from this branch https://github.com/dselivanov/text2vec/tree/0.4 – Dmitriy Selivanov Aug 04 '16 at 11:03