I am working with a large dataset with over 200 million rows. I load the dataset using the vroom package to speed up processing time. When I filter the dataset using an %in% condition, the process misses observations. I am wondering if there is a limit that exists on how many rows dplyr will successfully filter. The dataset is too large to load for a reproducible example, but the code I use to conduct the filter process is (roughly):
library(tidyverse)
library(vroom)
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 10)
data <- vroom("data.csv", delim = ",")
subset_data <- data %>%
filter(ID %in% list)
Where the dataset 'data.csv' contains 200 million observations, "ID" is a column name in the "data" dataframe, and "list" is a vector of ID numbers that fit the desired search criteria.
I expect about 6 million rows to meet the criteria, but a little over 3 million are returned. I am wondering if there is a limitation on the number of rows that filter can search. For example, if I can only search 100 million rows, it would explain why I am missing about half of the expected observations. Or, does loading the data using vroom impact the number of rows I can successfully filter?