6

My dataframe has some variables that contain missing values as strings like "NA". What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()?

I am using sqldf to query the database.

Reproducible example:

vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")


df = data.frame(vect1,vect2,vect3)
jgozal
  • 1,480
  • 6
  • 22
  • 43
  • 5
    The best way would be to deal with these when you are reading in the data. If you are reading in with the standard tools have a look at the `na.strings` argument. Otherwise [this](http://stackoverflow.com/questions/9351089/replacing-missing-values-coded-by-in-an-r-dataframe) should help, but replace the period with NA – user20650 Jan 02 '16 at 14:33
  • what are considered the standard tools? I'm reading all data from a sql database which might make it hard to deal with these when reading the data – jgozal Jan 02 '16 at 14:34
  • okay, what function are you using specifically.. pretty sure they will have an na.strings argument – user20650 Jan 02 '16 at 14:36
  • yes I don't think na.strings would work in my situation unfortunately – jgozal Jan 02 '16 at 14:36
  • I am using sqldf to query the database – jgozal Jan 02 '16 at 14:37
  • .hmmm, not so sure then. But the link in the comment should help to do this after reading in – user20650 Jan 02 '16 at 14:40
  • thank you for adding that in. I can't seem to find an na.strings equivalent for sqldf. What's so wrong about parsing it after reading the data though? – jgozal Jan 02 '16 at 14:43
  • For me, mainly it is just easier. But (i think) when you make these replacements that copies of the data are made which can be a problem for large data. – user20650 Jan 02 '16 at 14:46
  • I agree, I could write a for loop that went through all rows and converted all "NA" strings to real NAs but that would take very long since my dataframe is about 5 million rows and about 250 variables, hence the "most efficient" in my question – jgozal Jan 02 '16 at 14:46
  • Building on @user20650's suggestion, you can also use `type.convert` after having read the data in, as I did [in my answer](http://stackoverflow.com/a/34567882/1270695). – A5C1D2H2I1M1N2O1R2T1 Jan 02 '16 at 16:13

3 Answers3

5

To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace would look like:

df <- replace(df, df == "NA", NA)

Another alternative to consider is type.convert. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.

df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df

Here's a performance comparison. The sample data is from @roland's answer.

Here are the functions to test:

funop <- function() {
  df[df == "NA"] <- NA
  df
}

funr <- function() {
  ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
  as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
    is.na(x) <- x == "NA"
    x
  }), .SDcols = ind][]
}

funam1 <- function() replace(df, df == "NA", NA)

funam2 <- function() {
  df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
  df
}

Here's the benchmarking:

library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
#      expr      min       lq     mean   median       uq      max neval
#   funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287    10
#    funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837    10
#  funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706    10
#  funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258    10

replace would be the same as @roland's approach, which is the same as @jgozal's. However, the type.convert approach would result in different column types.

all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())

str(funop())
# 'data.frame': 10000000 obs. of  3 variables:
#  $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
#  $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
#  $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...

str(funam2())
# 'data.frame': 10000000 obs. of  3 variables:
#  $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
#  $ vect2: int  NA 5 1 NA 1 NA NA 1 NA 5 ...
#  $ vect3: logi  NA NA NA NA NA NA ...
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

I found this nice way of doing it from this question:

So for this particular situation it would just be:

df[df=="NA"]<-NA   

It only took about 30 seconds with 5 million rows and ~250 variables

Community
  • 1
  • 1
jgozal
  • 1,480
  • 6
  • 22
  • 43
  • You can do this much faster, e.g., using package data.table. But you don't provide a reproducible example of `df`, which precludes me from composing and testing/benchmarking an answer. – Roland Jan 02 '16 at 14:58
  • reproducible example added – jgozal Jan 02 '16 at 15:04
4

This is slightly faster:

set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user      system     elapsed 
#3.601       0.378       3.984 

library(data.table)
setDT(df2)
system.time({
  #find character and factor columns
  ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
  #assign by reference
  df2[, names(df2)[ind] := lapply(.SD, function(x) {
  is.na(x) <- x == "NA"
  x
}), .SDcols = ind]
})
# user      system     elapsed 
#2.484       0.190       2.676 
all.equal(df, setDF(df2))
#[1] TRUE
Roland
  • 127,288
  • 10
  • 191
  • 288