3

Today I've found an issue I cannot explain. Is this a well-known behavior?

Dataset:

structure(list(Original.Unit = c("some unit", "some unit", "some unit", 
"some unit", "some unit", "some unit"), Result = c(24, 28, NA, 
4.1, 4.5, 2.6), Conversion.Factor = c(1, 1.54, 1, 2.2, 1, 1)), .Names = c("Original.Unit", 
"Result", "Conversion.Factor"), row.names = c(NA, 6L), class = "data.frame")

Code:

> require(sqldf)

> (data <- dget("file"))   # "file" contains the above structure
  Original.Unit Result Conversion.Factor
1     some unit   24.0              1.00
2     some unit   28.0              1.54
3     some unit     NA              1.00
4     some unit    4.1              2.20
5     some unit    4.5              1.00
6     some unit    2.6              1.00

> sapply(data, function(d) { class(d)})
    Original.Unit            Result Conversion.Factor 
      "character"         "numeric"         "numeric" 

Let's query it this way:

> (result <- sqldf("SELECT `Original.Unit`, Result, `Conversion.Factor`, Result * `Conversion.Factor` AS ConvResult FROM data"))
  Original.Unit Result Conversion.Factor ConvResult
1     some unit   24.0              1.00      24.00
2     some unit   28.0              1.54      43.12
3     some unit     NA              1.00         NA
4     some unit    4.1              2.20       9.02
5     some unit    4.5              1.00       4.50
6     some unit    2.6              1.00       2.60

> sapply(result, function(r) { class(r)})
    Original.Unit            Result Conversion.Factor        ConvResult 
      "character"         "numeric"         "numeric"         "numeric" 

As far as good. Now let's sort the result by the last column:

> (result <- sqldf("SELECT `Original.Unit`, Result, `Conversion.Factor`, Result * `Conversion.Factor` AS ConvResult FROM data ORDER BY ConvResult"))
  Original.Unit Result Conversion.Factor ConvResult
1     some unit     NA              1.00       <NA>
2     some unit    2.6              1.00        2.6
3     some unit    4.5              1.00        4.5
4     some unit    4.1              2.20       9.02
5     some unit   24.0              1.00       24.0
6     some unit   28.0              1.54      43.12

And look at column types:

> sapply(result, function(r) { class(r)})
    Original.Unit            Result Conversion.Factor        ConvResult 
      "character"         "numeric"         "numeric"       "character" 

Why the ConvResult column is now of type character? Is this because of NA?

It seems that's the point. When I replaced NA with, say, 1000, ConvResult became numeric. But why does it happen?

Bastian
  • 313
  • 1
  • 13
  • Workaround, order the result outside sqldf. `result <- result[order(result$ConvResult),]` – zx8754 Sep 24 '15 at 13:46
  • 2
    SQLite uses the first row to get the type information and in this case the first row of `ConvResult` is NA so it can't tell. You could try your query using the H2 backend: `library(RH2); sqldf("...")` – G. Grothendieck Sep 25 '15 at 00:58

1 Answers1

3

Yes, seems you are right. When using ORDER BY sqldf tends to guess class of output column and in this case it makes a mistake. So, I guess you could set column types yourself just to be sure:

  result <- sqldf("SELECT `Original.Unit`, Result, `Conversion.Factor`, Result * `Conversion.Factor` AS ConvResult FROM data ORDER BY ConvResult",method = c("character", "numeric", "numeric", "numeric"))
milos.ai
  • 3,882
  • 7
  • 31
  • 33
  • Thank you for the explanation! sqldf exposes even more issues. It loses encoding (or rather encodes literals from UTF8 to a local encoding), sometimes loses POSIXct class and more. But I love it anyway :) Since I've created function sqldfEx which "wraps" sqldf and immediately fixes many "issues", i will update it with this one. Thanks! – Bastian Sep 24 '15 at 16:03