How to store where a passenger gets on and off a train whilst minimising size of file for plotting?

Question

I have 500GB of .csv data which include these three (and other) variables: 1. where a passenger gets on a train, 2. where they get off and 3. The time it takes.

I need to make box plots of the time it takes based on where they got on and where they got off in an interactive R-shiny app - this is straight forward. But first I need to minimise the size of the file as reading 500GB in an R shiny app is prohibitive. Is there a way to store these variables in such a way that it makes this possible?

Even with vroom it takes too long and I don't think {Disk.frame} would work either. Any thoughts?

What question are you trying to answer? It seems to me you could take a much smaller random sample of the 500GB data set to statistically impute what the aggregate OD profile looks like. — SteveM, Jan 12 '21 at 12:57
I think you're right. I'm trying to look at the journey time between stops based on some other variables (e.g. % of trains running, time of day) which are also included next to the passenger in the file. So each row is a passenger which includes their boarding and alighting station, train stuff, time of day, etc. — HCAI, Jan 12 '21 at 13:10
Just a raw idea: You could divide the csv file into many based on the combination of boarding station and where the get off and then process/summarise each file down separately to the bare minimum needed for the app? — s_baldur, Jan 12 '21 at 15:16
maybe read it once as save into `fst` format and then use fst — jangorecki, Jan 12 '21 at 16:02
Thank you all for your suggestions. I will try all them and now I'm a big fan of fst. Is there a way of still being able to query the data without having to repeat factors in columns? — HCAI, Jan 13 '21 at 07:02

How to store where a passenger gets on and off a train whilst minimising size of file for plotting?

0 Answers0