Summary statistics on out-of-memory file

Question

I have a csv file that's 120GB in size which is a set of numerical values grouped by categorical variables.

eg.

df<-as.data.frame(x=rbing(rep("BLO",100),rep("LR",100)), y=runif(200))

I would like to calculate some summary statistics using group_by(x) but my file doesn't fit into memory. What are my options? I've looked at tidyfst and {disk.frame} but I'm not sure. Any help would be much appreciated.

Thank you.

Yes but as it turns out I also have an fst file for good measure. — HCAI, Jan 15 '21 at 20:41
tidyfst allows [grouping by](https://github.com/hope-data-science/tidyfst), but from my experience, as soon as the data is loaded, it becomes slow. Another option would be Microsoft R with RevoScaleR grouping — akrun, Jan 15 '21 at 20:41
I'm not sure if you've already come across this, but https://bookdown.org/rdpeng/RProgDA/working-with-large-datasets.html#out-of-memory-strategies — latlio, Jan 15 '21 at 20:42
@akrun Yes I was trying tidyfst out last night but I thought it was trying to actually load all the data into memory. So instead what I was doing was tidyfst::parse("fstFile.fst") -> ft but that won't let me use the group_by. Am I missing something? I'll look at RevoScaleR. — HCAI, Jan 15 '21 at 20:46
@HCAI the `parse_fst` is very fast, but that is it. Once you load the data, it becomes slow. If you have many columns and are using only a few, then load only those columns, with an index to track — akrun, Jan 15 '21 at 20:47
I have used RevoScaleR in the past with file size similar to yours in a server and it is fast as well — akrun, Jan 15 '21 at 20:54
Ahh that makes sense about tidyfst. It took a very long time to run the query that my session kept timing out. I thought it was reading in the data as the documentation isn't quite clear. I'm on a Mac so looks like RevoScaleR isn't available. — HCAI, Jan 15 '21 at 22:38

Summary statistics on out-of-memory file

0 Answers0