4

Total newbie to R, have just spent a couple of hours playing and thought I'd have a play with some of the NHANES datasets e.g. ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2003-2004/

So grabbed a couple and after a play with merge(bmx_c, demo_c) and a quick Google I thought the sqldf library would be a more efficient way to merge / extract just a few columns from the files, to play with, but I've hit a problem.

Error in match.fun(asfn) : 
  'c("as.labelled", "as.integer")' is not a function, character or symbol**

The NHANES files are in SAS format so I had to:

install.packages("Hmisc")
install.packages("sqldf")

library(Hmisc)
library(sqldf)

demo_c <- sasxport.get("DEMO_C.XPT")
bmx_c <- sasxport.get("BMX_C.XPT")

is.data.frame(demo_c)
[1] TRUE

sqldf("select seqn from demo_c")
Error in match.fun(asfn) : 
  'c("as.labelled", "as.integer")' is not a function, character or symbol
>  

summary(demo_c$seqn)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21000   23540   26070   26070   28600   31130 
> 

I'm guessing some type conversion is required, but I don't know the subtleties of R.

Alex A.
  • 5,466
  • 4
  • 26
  • 56
arober11
  • 1,969
  • 18
  • 31
  • I think that because you are new to R you may have thought that the `sqldf` package would be faster (but in reality also probably because it made more sense? I remember being new to R and coming at it from a SAS perspective myself), but in fact, this might be easier to do in R once you get the data in rather than relying on a 3rd party package. What are you trying to produce as an output? – TARehman May 06 '15 at 17:39
  • @BondedDust: Quoting **seqn**, gives me the literal **seqn** 10k times in the output, not the value e.g. 9998 seqn 9999 seqn 10000 seqn [ reached getOption("max.print") -- omitted 122 rows ] – arober11 May 06 '15 at 17:42
  • Can you just do `demo_c[["seqn"]]` or `demo_c$seqn`? Why are you using `sqldf` like that to begin with? – Alex A. May 06 '15 at 17:42
  • Out of curiosity I was just going to have a quick stab at reproducing the results in this [paper](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0039504), so to start, pull in the relevant columns e.g **year1 <- sqldf("SELECT 'seqn', 'bmxwt', 'bmxht', 'bmxbmi', 'bmxwaist', 'RIAGENDR', 'RIDAGEYR', 'RIDAGEEX', 'RIDRETH1' FROM demo_a JOIN bmx_a USING(seqn) WHERE RIDEXPRG!=1 AND RIDAGEYR > 17")** – arober11 May 06 '15 at 17:49
  • Then work or a few percentiles e.g **quantile(scale(year1$bmxbmi), seq(0, 1, 0.01), na.rm=TRUE)** – arober11 May 06 '15 at 17:52
  • Thank you for that citation on a "new BMI". I had done an analysis of a much larger dataset using insured lives mortality, and similarly found that the power of 2 in the BMI denominator was not as good a fit as a fractional exponent. I don't think it was the same as the one in that paper, but it might be fun to publish. – IRTFM May 06 '15 at 18:00

1 Answers1

4

sqldf does not support the "labelled" column class produced by Hmisc. All the columns seem to be integer or numeric so convert the columns to numeric first:

demo_c[] <- lapply(demo_c, as.numeric)
sqldf("select seqn from demo_c")

You could convert the integer ones to integer if you prefer:

isInt <- sapply(demo_c, inherits, "integer")
demo_c[isInt] <- lapply(demo_c[isInt], as.integer)    
demo_c[!isInt] <- lapply(demo_c[!isInt], as.numeric)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • 1
    Though all the columns don't appear to be integers. Perhaps `demo_c[] <- lapply(demo_c, as.integer)` would be a safer choice to remove the "labelled" class. – MrFlick May 06 '15 at 17:46
  • @G.Grthendieck Could `sqldf` check for inheritance of an atomic class before attempting to coerce to a function? I think that error is being generated after the `colClass` function appends the "as." and then tries to run as a function. – IRTFM May 06 '15 at 17:54
  • Have converted them all to numeric. – G. Grothendieck May 06 '15 at 17:55
  • 1
    Yes, sqldf could be more sophiticated in the presence of multiple class objects. Not sure if there would be a significant performance penalty. I have added it as an issue in github. – G. Grothendieck May 06 '15 at 23:15