0

I have 500GB of .csv data which include these three (and other) variables: 1. where a passenger gets on a train, 2. where they get off and 3. The time it takes.

I need to make box plots of the time it takes based on where they got on and where they got off in an interactive R-shiny app - this is straight forward. But first I need to minimise the size of the file as reading 500GB in an R shiny app is prohibitive. Is there a way to store these variables in such a way that it makes this possible?

Even with vroom it takes too long and I don't think {Disk.frame} would work either. Any thoughts?

HCAI
  • 2,213
  • 8
  • 33
  • 65
  • 2
    What question are you trying to answer? It seems to me you could take a much smaller random sample of the 500GB data set to statistically impute what the aggregate OD profile looks like. – SteveM Jan 12 '21 at 12:57
  • I think you're right. I'm trying to look at the journey time between stops based on some other variables (e.g. % of trains running, time of day) which are also included next to the passenger in the file. So each row is a passenger which includes their boarding and alighting station, train stuff, time of day, etc. – HCAI Jan 12 '21 at 13:10
  • 1
    Just a raw idea: You could divide the csv file into many based on the combination of boarding station and where the get off and then process/summarise each file down separately to the bare minimum needed for the app? – s_baldur Jan 12 '21 at 15:16
  • 1
    maybe read it once as save into `fst` format and then use fst – jangorecki Jan 12 '21 at 16:02
  • Thank you all for your suggestions. I will try all them and now I'm a big fan of fst. Is there a way of still being able to query the data without having to repeat factors in columns? – HCAI Jan 13 '21 at 07:02

0 Answers0