What is the fastest way to read a data frame in R compared to Stata?

Question

I have this data set with 57000 rows and 5500 columns. They are both numeric and character variables. I originally downloaded the data in .dta format and it is quite quick for Stata to read it. It takes .13 scones to do it, when I time it using the timer command.

Now, I have been using R and from I have read, it is supposed to be much more efficient. I exported my data to csv from Stata and even following the recommendations I read on stack exchange, the results are not convincing.

Here is the best solution I came across with:

library(data.table)
system.time(fread("~/Data/GSS/GSS.csv", stringsAsFactors=FALSE, header=T, na.strings=paste0(".",letters), data.table=FALSE)

)

I get:

Read 57061 rows and 5548 (of 5548) columns from 1.053 GB file in 00:00:46
  user  system elapsed 
  52.000   1.492  53.470

I also get a lot of warnings regarding the missing values, although I have declared them. The Warnings:

Bumped column XXXX to type character on data row XXXX, field contains '.n'.

I think it has to do with R not recognizing these missing values in numeric columns

Any suggestions on how to improve this? As a side note, I tried sqldf but it just did not work on my computer, even upgrading the package to the most recent version.

Here is the data I am working with: http://www3.norc.org/GSS+Website/Download/

Stata can read the `.dta` faster because that's a _binary_ format. It's almost always more efficient to read a binary file than a text file. An appropriate comparison would be against how quickly Stata reads the csv. Have you tried to read the `.dta` file in R with `foreign::read.dta`? — Joshua Ulrich, Oct 30 '14 at 16:03
But you said `fread` left many of the columns as character, likely because of Stata's multiple missing value types. Does `read.dta` handle them like you'd expect? `fast + not_what_you_want < slow + correct` — Joshua Ulrich, Oct 30 '14 at 16:34
Then just read it in once with `read.dta`, then save it to a `.Rdata` or `.rds` file and be done with it. :) — Joshua Ulrich, Oct 30 '14 at 16:49
So much faster know. Thanks for the clarification, I've learnt something today — Doon_Bogan, Oct 31 '14 at 08:53

score 1 · Answer 1 · answered Apr 01 '18 at 04:42

I find read.dta13 from readstata13 package to be the best in reading Stata files in general.

The main advantage of read.dta13 is its ability to read Stata's labelled data correctly into a factor format and keeps the order, which I find it extremely important. It also reads any version of Stata, including Stata 15 files.

I could not do this using haven and foreign packages.

What is the fastest way to read a data frame in R compared to Stata?

1 Answers1