how to consider additional variables
I am working on a classification task using quanteda in R and I want to include some variables to be considered by my models apart from the bag of words. for instance, I computed dictionary based sentiment indexes and I d like to include these variables so that the models consider them.
these are the indexes I created, for each document.
dfneg <- cbind(negDfm1@docvars$label , negDfm1@x ,posDfm@x , angDfm@x ,
disgDfm1@x)
colnames(dfneg) <- c("label","neg" , "pos" , "ang" , "disg")
dfneg <- as.data.frame(dfneg)
this is the document features matrix I will work with:
DFM
newsdfm <- dfm(newscorp, tolower = TRUE , stem = FALSE , remove_punct =
TRUE, remove = stopwords("english"),verbose=TRUE)
newst<- dfm_trim(newsdfm , min_docfreq=2 , verbose=TRUE)
id_train <- sample(1:6335, 5384, replace = FALSE)
# create docvar with ID
docvars(newst, "id_numeric") <- 1:ndoc(newst)
# get training set
train <- dfm_subset(newst, id_numeric %in% id_train)
# get test set (documents not in id_train)
test <- dfm_subset(newst, !id_numeric %in% id_train)
finally, I run a classification, for instance, a Naive Bayes classifier or lasso
Naive Bayes classifier or lasso
NBmodel <- textmodel_nb(train , train@docvars$label)
lasso <- cv.glmnet(train, train@docvars$label,
family="binomial", alpha=1, nfolds=10,
type.measure="class")
this is what I tried after creating the dfm, but it didn't work
newsdfm@Dimnames$features$negz <- dfneg$neg
newsdfm@Dimnames$features$posz <- dfneg$pos
newsdfm@Dimnames$features$angz <- dfneg$ang
newsdfm@Dimnames$features$disgz <- dfneg$disg
then I thought of creating document variables before creating newsdfm
docvars(newscorp , "negz") <- dfneg$neg
docvars(newscorp , "posz") <- dfneg$pos
docvars(newscorp , "angz") <- dfneg$ang
docvars(newscorp , "disgz") <- dfneg$disg
but at that point, I don't know how to tell the classifier that I want it to consider also these document variables in addition to the bag of words.
In summary, I expect the model to consider both the matrix with all the words per each document and the indexes I created per each document.
any suggestion is highly appreciated
thank you in advance,
Carlo