5

If I try to get an average of c(NA, NA, 3, 4, 5, 6, 7, 8, 9, 10) using AVG from SQL, I get a value of 5.2, instead of the expected 6.5.

# prepare data and write to file
write.table(data.frame(col1 = c(NA, NA, 3:10)),
        "my.na.txt", row.names = FALSE)

mean(c(NA, NA, 3:10), na.rm = TRUE) # 6.5

my.na <- read.csv.sql("my.na.txt", sep = " ",
        sql = "SELECT AVG(col1) FROM file") # 5.2

# this is identical to
sum(3:10)/10

unlink("my.na.txt") # remove file

Which leads me to believe that sql(df) treats NA values as zero. Is it possible to ignore (exclude) NA values in an SQL call as it can be done using na.rm argument (in R)?

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • 1
    `?sqldf` says: "The `dbWriteTable/sqliteImportFile` routines that `sqldf` uses to transfer files to the data base are intended for speed and they are not as flexible as `read.table`." Although perhaps not the ideal solution, the `filter` argument of `read.csv.sql` can be used to filter out the `NA` lines before they are read into SQLite. There are some examples of the `filter` argument in `?sqldf` and on the sqldf home page: http://sqldf.googlecode.com . – G. Grothendieck Jan 14 '12 at 05:38

3 Answers3

7

Modify your query to ignore the NA values:

SELECT AVG(col1)
FROM file
WHERE col1 IS NOT \"NA\"
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Hum, I'm using `SELECT AVG(col1) FROM file WHERE col1 IS NOT NULL` and still get 5.2. – Roman Luštrik Jan 14 '12 at 00:51
  • 1
    are the values in the DB N/A or NULL? (select col1 from file where col1 is not null) [do the NA's show up?] or (Select Col1 from file where col1 <> 'NA' [do the NA's show up?] – xQbert Jan 14 '12 at 00:56
  • Using `SELECT AVG(col1) FROM file WHERE col1 IS NOT \"NA\"` works fine and dandy. The trick is that NAs are quoted, and you need to escape them. Thank you both for the tip. – Roman Luštrik Jan 14 '12 at 01:00
2

The problem is that the read.csv.sql function does not recognize the missing values, and converts them to zero, instead of NULL. This does not happen if you first load the data into a data.frame, and only then call sqldf.

d <- read.csv("my.na.txt")
sqldf("SELECT AVG(col1) FROM d") # 6.5
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
0

For future reference. I found this thread when researching the issue, a solution can be found in the definition of NA. When transferred to SQLite, NAs are converted to NULL, and then you can use IS NULL or IS NOT NULL to exclude them.

Fredrik Karlsson
  • 485
  • 8
  • 21