1

I have a dataset with 3.9M rows, 5 columns, stored as a tibble. When I try to convert it to tsibble, I run out of memory even though I have 32 GB which should be way more than enough. The weird thing is that if I apply a filter function before piping it into as_tsibble() then it works, even though I'm not actually filtering out any rows.

This does not work:

dataset %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))

This works. But there are no "Phase" values less than 1 so the filter does nothing, no rows are actually removed.

dataset %>% filter(Phase > 0) %>% as_tsibble(index = TimeStamp, key = c("TSSU", "Phase"))

Any ideas why the second option works? Here's what the dataset looks like:

Volume <dbl> Travel_Time <dbl> TSSU <chr> Phase <int> TimeStamp <dttm>
105 1.23 01017 2 2020-09-28 10:00:00
20 1.11 01017 2 2020-09-28 10:15:00
Phil
  • 7,287
  • 3
  • 36
  • 66

1 Answers1

1

Have you tried using the data.table library? It is optimized for performance with large datasets. I have replicated your steps and depending on where the dataset variable is coming from you may want to use the fread function to load the data as it is also very fast.

library(data.table)
dataset <- data.table(dataset)
# setkeyv(x = dataset, cols = c("TSSU", "Phase")) # This line may not be needed
dataset[Phase>0, ]
Mark Derry
  • 31
  • 3