1

I have been working on a machine learning project with tweets, including a classification problem. As a consequence, I have a training set and a testing set of tweets.

On the training set, I have computed a TF-IDF matrix with "tm" R package:

library(tm)
text_matrix <- DocumentTermMatrix(myCorpus_2, 
                 control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))

Now, I want to get a similar term document matrix for my test dataset, with the same words in columns.

And I do not have any idea on how to generate a TF-IDF matrix while specifying the list of columns I want. Does any of you know how I could do ?

EDIT: Actually, I am looking for an equivalent of sklearn.feature_extraction.text.TfidfVectorizer in R.

aprevel
  • 41
  • 4
  • 1
    This has already been solved and should be marked as duplicate. Have a look for instance at the detailed solution in http://stackoverflow.com/questions/16630627/how-to-recreate-same-documenttermmatrix-with-new-test-data – Eric Lecoutre Feb 06 '17 at 15:42
  • 2
    Possible duplicate of [How to recreate same DocumentTermMatrix with new (test) data](http://stackoverflow.com/questions/16630627/how-to-recreate-same-documenttermmatrix-with-new-test-data) – emilliman5 Feb 06 '17 at 16:51
  • Make the TDM for your test data then subset the TDM based on which columns are in your training TDM. Something akin to `colnames(test) %in% colnames(train)` – emilliman5 Feb 06 '17 at 16:53

0 Answers0