0

I have a csv file that's 120GB in size which is a set of numerical values grouped by categorical variables.

eg.

df<-as.data.frame(x=rbing(rep("BLO",100),rep("LR",100)), y=runif(200))

I would like to calculate some summary statistics using group_by(x) but my file doesn't fit into memory. What are my options? I've looked at tidyfst and {disk.frame} but I'm not sure. Any help would be much appreciated.

Thank you.

HCAI
  • 2,213
  • 8
  • 33
  • 65
  • Do you have a csv file or fst – akrun Jan 15 '21 at 20:40
  • Yes but as it turns out I also have an fst file for good measure. – HCAI Jan 15 '21 at 20:41
  • 1
    tidyfst allows [grouping by](https://github.com/hope-data-science/tidyfst), but from my experience, as soon as the data is loaded, it becomes slow. Another option would be Microsoft R with RevoScaleR grouping – akrun Jan 15 '21 at 20:41
  • 1
    I'm not sure if you've already come across this, but https://bookdown.org/rdpeng/RProgDA/working-with-large-datasets.html#out-of-memory-strategies – latlio Jan 15 '21 at 20:42
  • @akrun Yes I was trying tidyfst out last night but I thought it was trying to actually load all the data into memory. So instead what I was doing was tidyfst::parse("fstFile.fst") -> ft but that won't let me use the group_by. Am I missing something? I'll look at RevoScaleR. – HCAI Jan 15 '21 at 20:46
  • 1
    @HCAI the `parse_fst` is very fast, but that is it. Once you load the data, it becomes slow. If you have many columns and are using only a few, then load only those columns, with an index to track – akrun Jan 15 '21 at 20:47
  • 1
    I have used RevoScaleR in the past with file size similar to yours in a server and it is fast as well – akrun Jan 15 '21 at 20:54
  • Ahh that makes sense about tidyfst. It took a very long time to run the query that my session kept timing out. I thought it was reading in the data as the documentation isn't quite clear. I'm on a Mac so looks like RevoScaleR isn't available. – HCAI Jan 15 '21 at 22:38

0 Answers0