0

I want to read a .sav file into R. However, it is much too large (>11GB). If I could read in only portions of the data, that should be fine though not ideal. So, is there a way to do any of the following:

  • Read just the header (for column names)
  • Import only certain columns (rather than the entire dataset) - I've tried the functions from haven but can't seem to get the col_select argument to work.
  • Read in the entire dataset - I'm aware of tools for .csv files but not for .sav files.

Thanks for your help!

Trent
  • 771
  • 5
  • 19
  • There are many previous posts on reading .sav files. Have you tried those suggestions? The file size is probably a barrier so your best solution might be to open the file in SPSS (or have someone with SPSS do it for you), extract the variables you want, and then save that much smaller file. – dcarlson Aug 27 '20 at 22:01
  • I've found many posts on reading .sav files. I've also found many post on reading large data files (e.g. .csv). But I have yet to find a post detailing how to read a large .sav file. Currently, the popular big data tools in R (e.g. ff, bigmemory, sparklyr) do not have the capability of reading .sav files. – Trent Aug 28 '20 at 04:24
  • It appears that package `memisc` can do this, but I haven't tried it. Here's a [summary](https://www.elff.eu/software/memisc/import/). The example reads the first variable from a file with many variables. – dcarlson Aug 28 '20 at 13:35

1 Answers1

1

As far as getting the whole dataset into R, I do not think you would be able to read it in via chunks or any similar workaround and it be more memory efficient for the entire dataset. But, there are easy ways to get the column names and or specific variables that are more memory efficient:

Getting column names can be done using the n_max argument to read in an empty dataframe (with variable and value labels):

# using n_max = 0 is much more memory efficient
bench::mark(read_sav(temp)[0,],
            read_sav(temp, n_max = 0))[1:9]
# A tibble: 2 x 9
  expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 read_sav(temp)[0, ]          1.11s    1.11s     0.902    76.5MB     6.31     1     7      1.11s
2 read_sav(temp, n_max = 0)   5.86ms   6.13ms   154.       97.2KB     1.98    78     1    505.6ms

Getting specific columns can be done with select-helpers (or indices, or names, etc.) and is more memory efficient:

bench::mark(read_sav(temp)[c("V1", "V5")],
            read_sav(temp, col_select = matches("V(1|5)$")))[1:9]
# A tibble: 2 x 9
  expression                                           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 read_sav(temp)[c("V1", "V5")]                      1.06s    1.06s     0.939    76.5MB     5.64     1     6      1.06s
2 read_sav(temp, col_select = matches("V(1|5)$")) 186.45ms 187.89ms     5.32    528.7KB     0        3     0    563.5ms

Data set-up:

test <- as.data.frame(matrix(1:1e7, nrow = 1e4, ncol = 1e3))
temp <- tempfile()
write_sav(test, temp)

# file.remove(temp)
Andrew
  • 5,028
  • 2
  • 11
  • 21