0

I have this data set with 57000 rows and 5500 columns. They are both numeric and character variables. I originally downloaded the data in .dta format and it is quite quick for Stata to read it. It takes .13 scones to do it, when I time it using the timer command.

Now, I have been using R and from I have read, it is supposed to be much more efficient. I exported my data to csv from Stata and even following the recommendations I read on stack exchange, the results are not convincing.

Here is the best solution I came across with:

library(data.table)
system.time(fread("~/Data/GSS/GSS.csv", stringsAsFactors=FALSE, header=T, na.strings=paste0(".",letters), data.table=FALSE)

)

I get:

Read 57061 rows and 5548 (of 5548) columns from 1.053 GB file in 00:00:46
  user  system elapsed 
  52.000   1.492  53.470 

I also get a lot of warnings regarding the missing values, although I have declared them. The Warnings:

Bumped column XXXX to type character on data row XXXX, field contains '.n'. 

I think it has to do with R not recognizing these missing values in numeric columns

Any suggestions on how to improve this? As a side note, I tried sqldf but it just did not work on my computer, even upgrading the package to the most recent version.

Here is the data I am working with: http://www3.norc.org/GSS+Website/Download/

Doon_Bogan
  • 359
  • 5
  • 17
  • 3
    Stata can read the `.dta` faster because that's a _binary_ format. It's almost always more efficient to read a binary file than a text file. An appropriate comparison would be against how quickly Stata reads the csv. Have you tried to read the `.dta` file in R with `foreign::read.dta`? – Joshua Ulrich Oct 30 '14 at 16:03
  • `read.dta` is actually slower than `freed` using a csv! – Doon_Bogan Oct 30 '14 at 16:25
  • But you said `fread` left many of the columns as character, likely because of Stata's multiple missing value types. Does `read.dta` handle them like you'd expect? `fast + not_what_you_want < slow + correct` – Joshua Ulrich Oct 30 '14 at 16:34
  • read.dta does handle the NAs better... – Doon_Bogan Oct 30 '14 at 16:41
  • 3
    Then just read it in once with `read.dta`, then save it to a `.Rdata` or `.rds` file and be done with it. :) – Joshua Ulrich Oct 30 '14 at 16:49
  • So much faster know. Thanks for the clarification, I've learnt something today – Doon_Bogan Oct 31 '14 at 08:53

1 Answers1

1

I find read.dta13 from readstata13 package to be the best in reading Stata files in general.

The main advantage of read.dta13 is its ability to read Stata's labelled data correctly into a factor format and keeps the order, which I find it extremely important. It also reads any version of Stata, including Stata 15 files.

I could not do this using haven and foreign packages.

Masood Sadat
  • 1,247
  • 11
  • 18