I have this data set with 57000 rows and 5500 columns. They are both numeric and character variables. I originally downloaded the data in .dta format and it is quite quick for Stata to read it. It takes .13 scones to do it, when I time it using the timer
command.
Now, I have been using R and from I have read, it is supposed to be much more efficient. I exported my data to csv from Stata and even following the recommendations I read on stack exchange, the results are not convincing.
Here is the best solution I came across with:
library(data.table)
system.time(fread("~/Data/GSS/GSS.csv", stringsAsFactors=FALSE, header=T, na.strings=paste0(".",letters), data.table=FALSE)
)
I get:
Read 57061 rows and 5548 (of 5548) columns from 1.053 GB file in 00:00:46
user system elapsed
52.000 1.492 53.470
I also get a lot of warnings regarding the missing values, although I have declared them. The Warnings:
Bumped column XXXX to type character on data row XXXX, field contains '.n'.
I think it has to do with R not recognizing these missing values in numeric columns
Any suggestions on how to improve this? As a side note, I tried sqldf but it just did not work on my computer, even upgrading the package to the most recent version.
Here is the data I am working with: http://www3.norc.org/GSS+Website/Download/