3

Intro

I am struggling with text classification of a big dataset of tweets and I would be thankful if someone could point me in the right direction.

The big picture is that I need to train a classifier that would distinguish between two classes on a huge dataset (up to 6 million texts). I've been doing it in the recipes framework to then run glmnet lasso through tidymodels. The specific problem is that I am running out of memory when calculating tf-idf.

Question

Which way should I direct my efforts in resolving this? I could do it basically manually in batches to obtain all the tf-idf values and then again manually combine them into a sparse matrix object. This sounds anal and surely someone has had this problem before and solved it? Another option is Spark, but it is far beyond my abilities at the moment and is probably overkill for a one-time task. Or maybe I am missing something, and existing tools are capable of this?

Specifically, I am running into two kinds of problems when running the following (variables should be self-explanatory, but I will provide full reproducible code later):

recipe <-
  recipe(Class ~ text, data = corpus) %>% 
  step_tokenize(text) %>%
  step_stopwords(text) %>% 
  step_tokenfilter(text, max_tokens = m) %>% 
  step_tfidf(text) %>% 
  prep()

If corpus is too big or m is too large, Rstudio crashes. If they are moderately large, it throws a warning:

In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.2 GiB

I'm not finding anything about it online, and I don't understand it. Why is it trying to coerce something from sparse to dense? That surely spells trouble for any large dataset. Am I doing something wrong? If this is preventable, maybe I will have better luck with my full dataset?

Or is there no hope for step_tfidf to cope with 6m observations and no limit on max tokens?

P.S. tm and tidytext can't even begin to approach the issue.

Full Code

I'll give a reproducible example of what I am trying to do. This code sets up a corpus of tweet-long texts with random words of size 5m+:

library(tidymodels)
library(dplyr)
library(stringr)
library(textrecipes)
library(hardhat)

url <- "https://gutenberg.org/cache/epub/2701/pg2701-images.html"
words <- readLines(url, encoding = "UTF-8") %>% str_extract_all('\\w+\\b') %>% unlist()
x <- rnorm(n = 6000000, mean = 18, sd = 14)
x <- x[x > 0]

corpus <- 
  lapply(x, function(i) {
    c('text' = paste(sample(words, size = i, replace = TRUE), collapse = ' '))
  }) %>% 
  bind_rows() %>% 
  mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))

So corpus looks something like this:

> corpus
# A tibble: 5,402,638 × 3
   text                                                                                                                                       ID Class
   <chr>                                                                                                                                   <int> <fct>
 1 included Fast at can aghast me some as article and ship things is                                                                           1 1    
 2 him to quantity while became man was childhood it that Who in on his the is                                                                 2 1    
 3 no There a pass are it in evangelical rather in direst the in a even reason to Yes and the this unconditional his clear other thou all…     3 0    
 4 this would against his You disappeared have summit the vagrant in fine inland is scrupulous signifies that come the the buoyed and of …     4 1    
 5 slippery the Judge ever life Moby But i will after sounding ship like p he Like                                                             5 1    
 6 at can hope running                                                                                                                         6 1    
 7 Jeroboam even there slow though thought though I flukes yarn swore called p oarsmen with sort who looked and sharks young Radney s          7 1    
 8 not if rocks ever lantern go last though at you white his that remains of primal Starbuck sans you steam up with against                    8 1    
 9 Nostril as p full the furnish are nor made towards except bivouacks p blast how never now are here of difference it whalemen s much th…     9 1    
10 and p multitudinously body Archive fifty was of Greenland                                                                                  10 0    
# ℹ 5,402,628 more rows
# ℹ Use `print(n = ...)` to see more rows

It itself is around 1 Gb of RAM.

I do the standard modeling workflow that I will present here in full just for the fullness of information.

# prep
corpus_split <- initial_split(corpus, strata = Class) # split
corpus_train <- training(corpus_split)
corpus_test <- testing(corpus_split)
folds <- vfold_cv(corpus_train) #k-fold cv prep
sparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix") # use sparse matrices
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20) # hyperparameter calibration

# recipe
recipe <-
  recipe(Ad ~ text, data = corpus_train) %>% 
  step_tokenize(text) %>%
  step_stopwords(text, custom_stopword_source = 'twclid') %>% 
  step_tokenfilter(text, max_tokens = 10000) %>% 
  step_tfidf(text)

# lasso model
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% # tuning the penalty hyperparameter
  set_mode("classification") %>%
  set_engine("glmnet")

# workflow
sparse_wf <- workflow() %>%
  add_recipe(recipe, blueprint = sparse_bp) %>%
  add_model(lasso_spec)

# fit
sparse_rs <- tune_grid(
  sparse_wf,
  folds,
  grid = smaller_lambda
)

2 Answers2

5

Sadly there isn't much you can do right now, within the tidymodels to solve your task. The {tidymodels} set of packages revolves around using {tibble}s as their common data vessel. This works great in many situations, expect here for sparse data.

When a recipe is used in a workflow, it is required to hand off the data as a tibble to the parsnip. This required that the data to be non-sparse which in your case it going to explode the data size up wildly! e.i. if you have 6,000,000 observations and just 2000 different tokens, you are going to end up with 96GB...

This is something (I'm the author of {textrecipes} and one of the developers on the tidymodels team) I want to happen at one point, but it is currently outside the range of my controls as we need to find a way to have sparse data in tibbles.

EmilHvitfeldt
  • 2,555
  • 1
  • 9
  • 12
  • Oh, this is very helpful, thanks for the clarification. I imagine not many models accept sparse matrices, so that makes it difficult to unify the approach, which is the goal of your amazing framework? Good luck on the project! – George B. Y. Jun 26 '23 at 17:48
2

In case anybody needs it, I'll summarize my findings.

There are two problems: (i) creating a tf-idf matrix requires a lot of memory, and (ii) tinymodels currently only accepts tibbles as incoming data, as kindly pointed out by EmilHvitfeldt. The solution is to generate tf-idf dataset in a more memory-friendly way, sparsify with usual means, and then work directly with the models that support sparse data.

The biggest trouble was that existing solutions for calculating tf-idf (I tried tm and tidytext) are memory inefficient. What I did was the following:

  1. Caveat is that I have enough memory to load all texts into memory in the first place.
  2. Store texts as an arrow dataset with no grouping and max_rows_per_file = 1000000 (this number can be tailored to your memory requirements).
  3. Compute and store as separate arrow datasets the variables needed for calculating tf-idf: word counts, text lengths, and word-in-document counts.
  4. Loop through the files of one of the datasets, left-joining the data from the other two datasets (this happens in-memory, but because each file contains only a portion of total observations, it's not a problem).
  5. Manually save out as a parquet file within a dataset.
  6. Open the dataset as a dataset, collect, and tidytext::cast_sparse into a sparse matrix.
corpus %>% 
  write_dataset('tokenized_texts', max_rows_per_file = 1000000)

ds <- open_dataset('tokenized_texts')

# N is the total number of texts
N <- ds %>% 
  summarize(N = max(TextID)) %>% 
  collect() %>% 
  pull(N)

# this computes the number of times a word appears within a given text
ds.n <- 
  ds %>% 
  group_by(TextID, word) %>% 
  count() %>% 
  collect()

ds.n %>% 
  ungroup() %>% 
  write_dataset('tokenized_arrow/ds.n', max_rows_per_file = 1000000)
rm(ds.n)
gc()

# this computes the total number of words in the dataset
ds.total <- 
  ds %>%   
  group_by(TextID) %>% 
  count(name = 'TotalWords') %>% 
  collect()
ds.total %>% 
  ungroup() %>% 
  write_dataset('tokenized_arrow/ds.total', max_rows_per_file = 1000000)
rm(ds.total)
gc()

# this computes the number of times a word appears (at least once) in texts
ds.docs <- 
  ds %>% 
  group_by(TextID, word) %>% 
  summarize() %>% 
  group_by(word) %>% 
  count(name = 'Documents') %>% 
  collect()
ds.docs %>% 
  ungroup() %>% 
  write_dataset('tokenized_arrow/ds.docs', max_rows_per_file = 1000000)
rm(ds.docs)
gc()

# Load the prepared datasets
ds.n <- open_dataset('cache/tokenized_arrow/ds.n')
ds.total <- open_dataset('cache/tokenized_arrow/ds.total')
ds.docs <- open_dataset('cache/tokenized_arrow/ds.docs')

# Loop through (mclapply was an overkill, this is a super fast step). Assumes the directory "final" exists.

files <- list.files('tokenized_arrow/ds.n', full.names = TRUE)
mclapply(files, mc.cores = parallel::detectCores() - 2, FUN = function(file) {
  outfile <- str_replace(file, 'ds\\.n', 'final')
  
  df <- read_parquet(file)
  ids <- unique(df$TextID)
  words <- unique(df$word)
  df %>% 
    left_join(
      ds.total %>% 
        filter(TextID %in% ids) %>% 
        collect()) %>% 
    left_join(
      ds.docs %>%
        filter(word %in% words) %>%
        collect()
    ) %>% 
    mutate(tf = n / TotalWords,
           idf = log(N / Documents),
           tf_idf = tf * idf) %>% 
    write_parquet(outfile)
  return(NULL)
}) %>% invisible()


# sparsify
m <- 
  open_dataset('cache/tokenized_arrow/final/') %>% 
  collect() %>% 
  cast_sparse(TextID, word, tf_idf)