2

I have a data.table containing two date variables. The data set was read into R from a .csv file (was originally an .xlsx file) as a data.frame and the two variables then converted to date format using as.Date() so that they display as below:

df
  id   specdate    recdate
1  1 2014-08-12 2014-08-17
2  2 2014-08-15 2014-08-20
3  3 2014-08-21 2014-08-26
4  4       <NA> 2014-08-28
5  5 2014-08-25 2014-08-30
6  6       <NA>       <NA>

I then converted the data.frame to a data.table:

df <- data.table(df)

I then wanted to create a third variable, that would include "specdate" if present, but replace it with "recdate" if "specdate" was missing (NA). This is where I'm having some difficulty, as it seems that no matter how I approach this, data.table displays dates in date format only if a complete variable that is already in date format is copied. Otherwise, individual values are displayed as a number (even when using as.IDate) and I gather that an origin date is needed to correct this. Is there any way to avoid supplying an origin date but display the dates as dates in data.table?

Below is my attempt to fill the NAs of specdate with the recdate dates:

# Function to fill NAs:
fillnas <- function(dataref, lookupref, nacol, replacecol, replacelist=NULL) {
      nacol <- as.character(nacol)
      if(!is.null(replacelist)) nacol <- factor(ifelse(dataref==lookupref & (is.na(nacol) | nacol %in% replacelist), replacecol, nacol))
      else nacol <- factor(ifelse(dataref==lookupref & is.na(nacol), replacecol, nacol))
      nacol                
    }

# Fill the NAs in specdate with the function:
    df[, finaldate := fillnas(dataref=id, lookupref=id, nacol=specdate, replacecol=as.IDate(recdate, format="%Y-%m-%d"))]

Here is what happens:

    > df
   id   specdate    recdate  finaldate
1:  1 2014-08-12 2014-08-17 2014-08-12
2:  2 2014-08-15 2014-08-20 2014-08-15
3:  3 2014-08-21 2014-08-26 2014-08-21
4:  4       <NA> 2014-08-28      16310
5:  5 2014-08-25 2014-08-30 2014-08-25
6:  6       <NA>       <NA>         NA

The display problem is compounded if I create the new variable from scratch by using ifelse:

df[, finaldate := ifelse(!is.na(specdate), specdate, recdate)]

This gives:

> df
   id   specdate    recdate finaldate
1:  1 2014-08-12 2014-08-17     16294
2:  2 2014-08-15 2014-08-20     16297
3:  3 2014-08-21 2014-08-26     16303
4:  4       <NA> 2014-08-28     16310
5:  5 2014-08-25 2014-08-30     16307
6:  6       <NA>       <NA>        NA

Alternately if I try a find-and-replace type approach, I get an error about the number of items to replace not matching the replacement length (I'm guessing this is because that approach is not vectorised?), the values from recdate are recycled and end up in the wrong place:

> df$finaldate <- df$specdate
> df$finaldate[is.na(df$specdate)] <- df$recdate
Warning message:
In NextMethod(.Generic) :
  number of items to replace is not a multiple of replacement length
> df
   id   specdate    recdate  finaldate
1:  1 2014-08-12 2014-08-17 2014-08-12
2:  2 2014-08-15 2014-08-20 2014-08-15
3:  3 2014-08-21 2014-08-26 2014-08-21
4:  4       <NA> 2014-08-28 2014-08-17
5:  5 2014-08-25 2014-08-30 2014-08-25
6:  6       <NA>       <NA> 2014-08-20

So in conclusion - the function I applied gets me closest to what I want, except that where NAs have been replaced, the replacement value is displayed as a number and not in date format. Once displayed as a number, the origin is required to again display it as a date (and I would like to avoid supplying the origin since I usually don't know it and it seems unnecessarily repetitive to have to supply it when the date was originally in the correct format).

Any insights as to where I'm going wrong would be much appreciated.

Amy M
  • 967
  • 1
  • 9
  • 19
  • 1
    `ifelse` [strips attributes](http://stackoverflow.com/questions/31133382/ifelse-stripping-posixct-attribute-from-vector-of-timestamps) – eddi Oct 08 '15 at 20:48

2 Answers2

3

I'd approach it like this, maybe :

DT <- data.table(df)
DT[, finaldate := specdata]
DT[is.na(specdata), finaldate := recdate]

It seems you want to add a new column so you can can retain the original columns as well. I do that as well a lot. Sometimes, I just update in place :

DT <- data.table(df)
DT[!is.na(specdate), specdate:=recdate]
setnames(DT, "specdate", "finaldate")

Using i like that avoids creating a whole new column's worth of data which might be very large. Depends on how important retaining the original columns is to you and how many of them there are and your data size. (Note that a whole column's worth of data is still created by the is.na() call and then again by ! but at least there isn't a third column's worth for the new finaldate. Would be great to optimize i=!is.na() in future (#1386) and if you use data.table this way now you won't need to change your code in future to benefit.)

It seems that you might have various "NA" strings that you're replacing. Note that fread in v1.9.6 on CRAN has a fix for that. From README :

  • correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to @dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.

Btw, you've made one of the top 3 mistakes mentioned here : https://github.com/Rdatatable/data.table/wiki/Support

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thanks @ Matt: my data set is relatively small (< 1000 records) and I need to know which date "finaldate" comes from for some later summaries, so your first suggestion works best for me - but I can see the value of the second when working with larger data sets. So my third approach was on the right track, but my error was treating it as a data.frame instead of coding it the data.table way - would like to understand a bit better the difference in behavior though. I admit I didn't search for the error my 3rd approach resulted in - apologies for that. – Amy M Oct 09 '15 at 10:02
  • @AmyM Ok great. (It was to please state the version number - common mistake #3. See how Pierre said it worked for him so maybe simply upgrading to the latest version would have fixed it for you but we didn't know what version you were using). – Matt Dowle Oct 09 '15 at 12:19
  • Ah, it was R v 3.2.1 with data.table 1.9.4, so probably overdue an upgrade. – Amy M Oct 09 '15 at 13:05
  • @AmyM Yes for sure upgrade to v1.9.6 - fixes many bugs. Not really overdue as only recently on CRAN. – Matt Dowle Oct 09 '15 at 16:26
1

Works for me. You may want to test to be sure that your NA values are not strings or factors "<NA>"; they will look like real NA values:

dt[, finaldate := ifelse(is.na(specdate), recdate, specdate)][
  ,finaldate := as.POSIXct(finaldate*86400, origin="1970-01-01", tz="UTC")]
#    id   specdate    recdate  finaldate
# 1:  1 2014-08-12 2014-08-17 2014-08-12
# 2:  2 2014-08-15 2014-08-20 2014-08-15
# 3:  3 2014-08-21 2014-08-26 2014-08-21
# 4:  4         NA 2014-08-28 2014-08-28
# 5:  5 2014-08-25 2014-08-30 2014-08-25
# 6:  6         NA         NA         NA

Data

df <- read.table(text="  id   specdate    recdate
1  1 2014-08-12 2014-08-17
2  2 2014-08-15 2014-08-20
3  3 2014-08-21 2014-08-26
4  4         NA 2014-08-28
5  5 2014-08-25 2014-08-30
6  6         NA         NA", header=T, stringsAsFactors=F)

dt <- as.data.table(df)
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • The title mentions retaining the date format, but the data set you're working with has character cols instead. Not sure if it matters for the substance of your answer. – Frank Oct 08 '15 at 20:00
  • Thanks - the NA values should be real in both the example dataset I made and my real one (blank cells in the .csv file). It might be that I need to upgrade as @MattDowle has suggested. – Amy M Oct 09 '15 at 13:10