0

I am working with a large dataset with over 200 million rows. I load the dataset using the vroom package to speed up processing time. When I filter the dataset using an %in% condition, the process misses observations. I am wondering if there is a limit that exists on how many rows dplyr will successfully filter. The dataset is too large to load for a reproducible example, but the code I use to conduct the filter process is (roughly):

    library(tidyverse)
    library(vroom)
    Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 10)
    data <- vroom("data.csv", delim = ",")
    
    subset_data <- data %>%
    filter(ID %in% list) 

Where the dataset 'data.csv' contains 200 million observations, "ID" is a column name in the "data" dataframe, and "list" is a vector of ID numbers that fit the desired search criteria.

I expect about 6 million rows to meet the criteria, but a little over 3 million are returned. I am wondering if there is a limitation on the number of rows that filter can search. For example, if I can only search 100 million rows, it would explain why I am missing about half of the expected observations. Or, does loading the data using vroom impact the number of rows I can successfully filter?

SolarSon
  • 11
  • 4
  • 4
    To get a better understanding you may try doing the same thing in base R and `data.table`. `subset_data <- subset(data, ID %in% list)` and `setDT(data)[ID %in% list]` and see if the issue is `dplyr` specific, R specific or there is some issue in the data itself. – Ronak Shah Oct 26 '21 at 03:30
  • 2
    Thanks @RonakShah for this suggestion. I've tried the two approaches you suggest and the exact same number of observations is returned (~3 million). That in effect answers my question since it indicates that it isn't a dplyr filter limitation. I'll continue working with the data to understand whether there is an issue within the data itself. – SolarSon Oct 26 '21 at 03:43

0 Answers0