k-fold cross validation in quanteda

Question

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?

Many thanks

I tried to add a cross validation based on this example: https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html

require(quanteda)
require(quanteda.textmodels)
require(caret)


corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)

# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>% 
  tokens_remove(pattern = stopwords("en")) %>% 
  tokens_wordstem()
dfmt_movie <- dfm(toks_movies)

# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)

# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)

tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)

dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))

actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class

require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")

#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)

but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"

score 0 · Accepted Answer · answered Jan 16 '23 at 11:04

0

It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

answered Jan 16 '23 at 11:04

Riberiusz

26
1

Thanks a lot! Yes, I do want to use the crossval function from the quanteda package. How I can make sure the function is used from the correct package? – Max Overbeck Jan 16 '23 at 13:38
First, you must install it with `require(remotes); remotes::install_github("quanteda/quanteda.classifiers")`. Then, load it with `library(quanteda.classifiers)`. To be absolutely sure it is a proper function type `quanteda.classifiers::crossval()` instead of plain `crossval()`. – Riberiusz Jan 16 '23 at 14:28

k-fold cross validation in quanteda

1 Answers1