Intro
I am struggling with text classification of a big dataset of tweets and I would be thankful if someone could point me in the right direction.
The big picture is that I need to train a classifier that would distinguish between two classes on a huge dataset (up to 6 million texts). I've been doing it in the recipes framework to then run glmnet lasso through tidymodels. The specific problem is that I am running out of memory when calculating tf-idf.
Question
Which way should I direct my efforts in resolving this? I could do it basically manually in batches to obtain all the tf-idf values and then again manually combine them into a sparse matrix object. This sounds anal and surely someone has had this problem before and solved it? Another option is Spark, but it is far beyond my abilities at the moment and is probably overkill for a one-time task. Or maybe I am missing something, and existing tools are capable of this?
Specifically, I am running into two kinds of problems when running the following (variables should be self-explanatory, but I will provide full reproducible code later):
recipe <-
recipe(Class ~ text, data = corpus) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_tokenfilter(text, max_tokens = m) %>%
step_tfidf(text) %>%
prep()
If corpus
is too big or m
is too large, Rstudio crashes. If they are moderately large, it throws a warning:
In asMethod(object) :
sparse->dense coercion: allocating vector of size 1.2 GiB
I'm not finding anything about it online, and I don't understand it. Why is it trying to coerce something from sparse to dense? That surely spells trouble for any large dataset. Am I doing something wrong? If this is preventable, maybe I will have better luck with my full dataset?
Or is there no hope for step_tfidf
to cope with 6m observations and no limit on max tokens?
P.S. tm
and tidytext
can't even begin to approach the issue.
Full Code
I'll give a reproducible example of what I am trying to do. This code sets up a corpus of tweet-long texts with random words of size 5m+:
library(tidymodels)
library(dplyr)
library(stringr)
library(textrecipes)
library(hardhat)
url <- "https://gutenberg.org/cache/epub/2701/pg2701-images.html"
words <- readLines(url, encoding = "UTF-8") %>% str_extract_all('\\w+\\b') %>% unlist()
x <- rnorm(n = 6000000, mean = 18, sd = 14)
x <- x[x > 0]
corpus <-
lapply(x, function(i) {
c('text' = paste(sample(words, size = i, replace = TRUE), collapse = ' '))
}) %>%
bind_rows() %>%
mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))
So corpus
looks something like this:
> corpus
# A tibble: 5,402,638 × 3
text ID Class
<chr> <int> <fct>
1 included Fast at can aghast me some as article and ship things is 1 1
2 him to quantity while became man was childhood it that Who in on his the is 2 1
3 no There a pass are it in evangelical rather in direst the in a even reason to Yes and the this unconditional his clear other thou all… 3 0
4 this would against his You disappeared have summit the vagrant in fine inland is scrupulous signifies that come the the buoyed and of … 4 1
5 slippery the Judge ever life Moby But i will after sounding ship like p he Like 5 1
6 at can hope running 6 1
7 Jeroboam even there slow though thought though I flukes yarn swore called p oarsmen with sort who looked and sharks young Radney s 7 1
8 not if rocks ever lantern go last though at you white his that remains of primal Starbuck sans you steam up with against 8 1
9 Nostril as p full the furnish are nor made towards except bivouacks p blast how never now are here of difference it whalemen s much th… 9 1
10 and p multitudinously body Archive fifty was of Greenland 10 0
# ℹ 5,402,628 more rows
# ℹ Use `print(n = ...)` to see more rows
It itself is around 1 Gb of RAM.
I do the standard modeling workflow that I will present here in full just for the fullness of information.
# prep
corpus_split <- initial_split(corpus, strata = Class) # split
corpus_train <- training(corpus_split)
corpus_test <- testing(corpus_split)
folds <- vfold_cv(corpus_train) #k-fold cv prep
sparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix") # use sparse matrices
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20) # hyperparameter calibration
# recipe
recipe <-
recipe(Ad ~ text, data = corpus_train) %>%
step_tokenize(text) %>%
step_stopwords(text, custom_stopword_source = 'twclid') %>%
step_tokenfilter(text, max_tokens = 10000) %>%
step_tfidf(text)
# lasso model
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% # tuning the penalty hyperparameter
set_mode("classification") %>%
set_engine("glmnet")
# workflow
sparse_wf <- workflow() %>%
add_recipe(recipe, blueprint = sparse_bp) %>%
add_model(lasso_spec)
# fit
sparse_rs <- tune_grid(
sparse_wf,
folds,
grid = smaller_lambda
)