0

I am trying to import a large SAS dataset (75.5 million rows, sas7bdat) into RStudio in a way in which I can work with the whole dataset. After doing some digging and talking to some people it sounds like what I want to do is read in the file without loading it into memory, which is why I was attempting to use chunked::read_csv_chunkwise as this suggests:

https://mnoorfawi.github.io/handling-big-files-with-R/

I used SAS to export the dataset as a csv file. v8_mc.csv. Then in R:

library(chunked)
library(dplyr)
## Here we don't read the file, we just get something like a pointer to it.
data_chunked <- read_csv_chunkwise("B:/SAS_DATASETS/v8_mc.csv",
                                   skip = 1,stringsAsFactors = FALSE,
                                   header = TRUE,sep = ",")

But I get the following warning:

In FUN(X[[i]], ...) : Unsupported type 'logical'; using default type 'string'

The documentation said that head() should work with the chunked object, so I said what the heck and tried:

> head(data_chunked)
Error in .local(x, ...) : 
  Conversion to int failed; line=957; column=52; string='V061'

I've never used SAS before and I'm a total newb to big data in R. Since I can't open the sas or csv file in R, I can't figure out how to make a reproducible example. I'd welcome help to make this a better question.

Thanks!

abra
  • 61
  • 1
  • 9
  • Have you tried `fread` from package `data.table`? – Sweepy Dodo Jan 31 '22 at 15:09
  • 1
    @SweepyDodo Yes, I tried `mc <- data.table::fread("B:/SAS_DATASETS/v8_mc.csv") Error: cannot allocate vector of size 657.5 Mb` – abra Jan 31 '22 at 15:23
  • Try `readr::read_csv` or `vroom::vroom`. – G. Grothendieck Jan 31 '22 at 15:32
  • @G.Grothendieck I tried `vroom::vroom( "F:/APCD new PHA only/SAS_DATASETS/v8_medicalClaims.csv", .name_repair = ~ janitor::make_clean_names(., case = "lower_camel"))` indexed 2.15GB in 8m, 51.03MB/s but the file is 50GB and it is taking 8 min for under 3GB. Is that reasonable? Will R be able to work with that without chunking it or using sqlite? – abra Jan 31 '22 at 16:26
  • @abra My concern is that even if you used [batch query](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) the same problem will persist. The problem is your local machine (pc/mac) not having enough memory I recently wanted to feed in a large csv into R to no avail. Same error `cannot allocate vector of size...`. csv was about 4 GB zipped (can't remember exact size when unzipped, easily doubled. Larger still when fed into R) This memory isn't exclusive to R. Same will happen with Pandas/Python. We are limited by local RAM – Sweepy Dodo Jan 31 '22 at 17:37
  • I have also tried packages `ff` and `bigmemory`. However, they did not offer the flexibility of dataframes (or `data.table`) as they are matrix based. Furthermore, they actually managed to import less data compared to `data.table` before running out of memory. I suppose this is when cloud computing is necessary. From my experience of R server (virtual machine)100 m rows is more than manageable (tho I believe also depends on the server/package one is paying) Nonetheless, I hope to hear from others for solutions – Sweepy Dodo Jan 31 '22 at 17:42
  • p.s. my machine (lap top) has 16GB RAM (not great). The 4GB zipped csv I mentioned had about 214 million rows – Sweepy Dodo Jan 31 '22 at 17:47
  • @SweepyDodo I'm actually connected via VPN and Remote Desktop Connection, to a server that supposedly has 128GB usable RAM. I think what I want to do is read in the file without loading the whole thing into memory, which is why I was attempting to use `chunked` as this suggests: https://mnoorfawi.github.io/handling-big-files-with-R/ – abra Jan 31 '22 at 18:34

0 Answers0