Consider the following example
library(text2vec)
library(glmnet)
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
text = c("this is a test", "this is another",'hello','what???'),
value = c(200,400,120,300),
output = c('win', 'lose','win','lose'))
> dataframe
# A tibble: 4 × 4
id text value output
<dbl> <chr> <dbl> <chr>
1 1 this is a test 200 win
2 2 this is another 400 lose
3 3 hello 120 win
4 4 what??? 300 lose
Now, I can use the excellent text2vec
to get a sparse matrix corresponding to the text
column. To do so, I simply need to follow the text2vec tutorial:
it_train = itoken(dataframe$text,
ids = dataframe$id,
progressbar = FALSE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
> dtm_train
4 x 7 sparse Matrix of class "dgCMatrix"
hello another what??? a is test this
1 . . . 1 1 1 1
2 . 1 . . 1 . 1
3 1 . . . . . .
4 . . 1 . . . .
This dtm sparse matrix can be feed into a ML model. But my problem is: how can I also use the value
variable?
That is, as input predictors in, say, glmnet or xgboost, I want to use my sparse matrix (that comes from the text variable), but also my value
variable, that contain some valuable information. How can I do that? Could we somehow add information to the sparse matrix?
Thanks!