R: how to add numeric variables to a sparse matrix?

Question

Consider the following example

library(text2vec)
library(glmnet)
library(dplyr)

dataframe <- data_frame(id = c(1,2,3,4),
                        text = c("this is a test", "this is another",'hello','what???'),
                        value = c(200,400,120,300),
                        output = c('win', 'lose','win','lose'))

> dataframe
# A tibble: 4 × 4
     id            text value output
  <dbl>           <chr> <dbl>  <chr>
1     1  this is a test   200    win
2     2 this is another   400   lose
3     3           hello   120    win
4     4         what???   300   lose

Now, I can use the excellent text2vec to get a sparse matrix corresponding to the text column. To do so, I simply need to follow the text2vec tutorial:

it_train = itoken(dataframe$text, 
                  ids = dataframe$id, 
                  progressbar = FALSE)

vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)

> dtm_train
4 x 7 sparse Matrix of class "dgCMatrix"
  hello another what??? a is test this
1     .       .       . 1  1    1    1
2     .       1       . .  1    .    1
3     1       .       . .  .    .    .
4     .       .       1 .  .    .    .

This dtm sparse matrix can be feed into a ML model. But my problem is: how can I also use the value variable?

That is, as input predictors in, say, glmnet or xgboost, I want to use my sparse matrix (that comes from the text variable), but also my value variable, that contain some valuable information. How can I do that? Could we somehow add information to the sparse matrix?

Thanks!

how do you want to add it? just like `cbind(dtm_train, value=dataframe$value)` or ... ?? — user20650, Jun 08 '17 at 00:45
i dont know what is the proper way. would that convert to a normal matrix? — ℕʘʘḆḽḘ, Jun 08 '17 at 00:47

score 0 · Answer 1 · answered Feb 26 '20 at 06:53

0

you can use sparse.hstacks

import numpy as np
from scipy.sparse import hstack

dtm_train = hstack((dtm_train,np.array(dataframe['value'])[:,None]))

remember you have to do a similar operation for your hold out data!

answered Feb 26 '20 at 06:53

rushikesh maheshwari

42
1
6

R: how to add numeric variables to a sparse matrix?

1 Answers1