1

I came across the text2vec package today and it's exactly what I need for a particular problem. However, I haven't been able to figure out how to export a dtm created with text2vec to some kind of output file. My ultimate goal is to generate features in R using text2vec and import the resulting matrices into H2O for further modeling. H2O can read either CSV or SVMLight formats.

The first one I've created is 987753 x 8806 sparse Matrix of class "dgCMatrix", with 3625049 entries, so it's pretty big. It's not possible to use as.matrix() to write it out to CSV since it's too big. I thought that I might be able to easily write it out as SVMLight format, but haven't been able to find a library that works. Anyone have any other options for getting this output to a file that I can read into H2O?

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
Dave Kincaid
  • 3,970
  • 3
  • 24
  • 32

1 Answers1

2

There are several packages who can do that. Take a look into https://github.com/Laurae2/sparsity - imho most promising:

library(text2vec)
library(sparsity)
data("movie_review")
N = 5000
tokens = movie_review$review[1:N] %>% tolower %>% word_tokenizer
it = itoken(tokens, progressbar = T)
dtm = create_dtm(it, hash_vectorizer())
write.svmlight(dtm, labelVector = movie_review$sentiment, file = "dtm.svmlight")
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks. I came across a few of them (including this one) and none of them work. They all throw some kind of error or another. – Dave Kincaid Nov 27 '16 at 18:17
  • 1
    @dave-kincaid everything works fine - see updated answer with example. I found issue you reported: https://github.com/felixr/sparsity/issues/1. The problem is that `labelVector ` should be numeric target variable. – Dmitriy Selivanov Nov 27 '16 at 18:53
  • Ah, yes! Thank you very much. It is working great now! – Dave Kincaid Nov 27 '16 at 19:11