2

I am trying to implement a very simple ML learning problem, where I use text to predict some outcome. In R, some basic example would be:

import some fake but funny text data

library(caret)
library(dplyr)
library(text2vec)

dataframe <- data_frame(id = c(1,2,3,4),
                        text = c("this is a this", "this is 
                        another",'hello','what???'),
                        value = c(200,400,120,300),
                        output = c('win', 'lose','win','lose'))

> dataframe
# A tibble: 4 x 4
     id            text value output
  <dbl>           <chr> <dbl>  <chr>
1     1  this is a this   200    win
2     2 this is another   400   lose
3     3           hello   120    win
4     4         what???   300   lose

Use text2vec to get a sparse matrix representation of my text (see also https://github.com/dselivanov/text2vec/blob/master/vignettes/text-vectorization.Rmd)

#these are text2vec functions to tokenize and lowercase the text
prep_fun = tolower
tok_fun = word_tokenizer 

#create the tokens
train_tokens = dataframe$text %>% 
  prep_fun %>% 
  tok_fun

it_train = itoken(train_tokens)     
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)

> dtm_train
4 x 6 sparse Matrix of class "dgCMatrix"
  what hello another a is this
1    .     .       . 1  1    2
2    .     .       1 .  1    1
3    .     1       . .  .    .
4    1     .       . .  .    .

Finally, train the algo (for instance, using caret) to predict output using my sparse matrix.

mymodel <- train(x=dtm_train, y =dataframe$output, method="xgbTree")

> confusionMatrix(mymodel)
Bootstrapped (25 reps) Confusion Matrix 

(entries are percentual average cell counts across resamples)

          Reference
Prediction lose  win
      lose 17.6 44.1
      win  29.4  8.8

 Accuracy (average) : 0.264

My problem is:

I see how to import data into h20 using spark_read_csv, rsparkling and as_h2o_frame. However, for points 2. and 3. above I am completely lost.

Can someone please give me some hints or tell me if this approach is even possible with h2o?

Many thanks!!

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

1

You can solve this one of two ways -- 1. in R first and then move to H2O for modeling or 2. Entirely in H2O using H2O's word2vec implementation.

Use R data.frames and text2vec, then convert the sparse matrix to an H2O frame and do the modeling in H2O.

 # Use same code as above to get to this point, then:

 # Convert dgCMatrix to H2OFrame, cbind the response col
 train <- as.h2o(dtm_train)
 train$y <- as.h2o(dataframe$output)

 # Train any H2O model (e.g GBM)
 mymodel <- h2o.gbm(y = "y", training_frame = train,
                   distribution = "bernoulli", seed = 1)

Or you can train a word2vec embedding in H2O, apply it to your text to get the equivalent of a sparse matrix. Then train a H2O machine learning model (GBM). I will try edit this answer later with a working example using your data, but in the meantime, here is an example demonstrating the use of H2O's word2vec functionality in R.

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • that's really cool, @ErinLeDell. Looking forward to your working example! Thanks – ℕʘʘḆḽḘ Jun 15 '17 at 12:08
  • 1
    But where did you find something about word2vec in question? Text2vec != Word2vec. Question is about how to export sparse matrix to h2o! And the way to do it - convert matrix to svmlight format. – Dmitriy Selivanov Jun 16 '17 at 05:18
  • 1
    The task Noobie asked about was the H2O-only equivalent of training a model on text -- I mean you can perform the same task (using Word2Vec) in H2O. The solution above allows you to keep using text2vec, but it also requires performing the text-processing computation in R memory (rather than distributed H2O w2v), so I suggested H2O w2v as a work-around. – Erin LeDell Jun 18 '17 at 22:31
  • @DmitriySelivanov exactly. In R, I stick to your nice package. In Spark, I hope the great ErinLeDell will be able to show a 100% H2o example using my small toy model :D Thanks to both you guys – ℕʘʘḆḽḘ Jun 19 '17 at 00:48
  • 1
    @erin-ledell still it is different. h2o allows to obtains vectors for words or for sentences (averaged word vectors). But example here was about bag-of-words model and large sparse matrix. BTW I took a look on how h2o coerces sparse matrix to the internal format (https://github.com/h2oai/h2o-3/blob/master/h2o-r/h2o-package/R/frame.R#L3219-L3251) - it is nightmare. I will send PR. – Dmitriy Selivanov Jun 19 '17 at 10:12
  • @DmitriySelivanov I am being the promoter of increasing synergies between `text2vec` and `h2o`! Dmitry what do you mean it is nightmare? That would not work with large sparce matrices? – ℕʘʘḆḽḘ Jun 19 '17 at 13:50
  • 1
    1) it is just wrong - silently and implicitly considers first column as target variable 2) will work very (very) long even for small data sets – Dmitriy Selivanov Jun 19 '17 at 17:28
  • @DmitriySelivanov were you able to change the code in h20 then? – ℕʘʘḆḽḘ Jun 20 '17 at 19:28
  • 1
    @noobie, for example use https://github.com/dselivanov/sparsio to write matrix as svmlight. Then you can read it with `h2o.uploadFile`. – Dmitriy Selivanov Jun 23 '17 at 13:23
  • 1
    @DmitriySelivanov The code is not wrong. We use SVMLight to transfer sparse data, the first column of a SVMLight file is the target/response variable. The implication of using it for just a generic matrix-like structure is that the first column always has a dense representation, the rest of the matrix has sparse representation. – Erin LeDell Jun 23 '17 at 16:41
  • 1
    Of course it is. Even in example above first column is not target, but h2o converts it to target and then override with actual target. – Dmitriy Selivanov Jun 23 '17 at 18:20
  • @DmitriySelivanov , Erin Le Dell please dont fight, cooperate! Just answer my original question! :D – ℕʘʘḆḽḘ Jun 24 '17 at 13:36
  • @ErinLeDell accepting while hoping for a pure `h20` small example as well :D – ℕʘʘḆḽḘ Jun 26 '17 at 19:23