0

I am having problems merging / joining data for the coming daylight savings shift. My time-vector d is supposed to be the controlling time-vector, so when I join with data with missing holes I just get NA values. This normally works brillantly. However, during the coming '2015-10-25 02:00:00' it goes horribly wrong.

Data example:

d <- seq.POSIXt(from = as.POSIXct("2015-10-25 00:00:00", "%Y-%m-%d %H:%M:%S", tz = ""), 
                to = as.POSIXct("2015-10-25 23:00:00", "%Y-%m-%d %H:%M:%S", tz = ""), by = "hour")

df1 <- data.frame(Date = d, value1 = 1:25)

df2 <- data.frame(Date = as.POSIXct(format(d, "%Y-%m-%d %H:%M:%S"), tz = ""), value2 = 26:50)

require(dplyr)

df <- left_join(df1, df2, by = "Date")

df <- merge(df1, df2, by = "Date", all.x = TRUE)

Both left_join and merge gives wrong results, and I am not sure what goes wrong. Well, I can see R has no idea how to handle the two repeated hours - and that is completely understandable. Both time series are POSIXct, but there is clearly some information I am missing? How can you handle this? I would prefer a base R-solution.

It gets exponentially worse, if you need to do even more joins from different data-sets. I need to join 7 and it just gets worse and worse.

The correct result is:

result <- data.frame(Date = d, var1 = df1[, 2], var2 = df2[, 2])
nicola
  • 24,005
  • 3
  • 35
  • 56
Thorst
  • 1,590
  • 1
  • 21
  • 35
  • 1
    What's your expected output? You say that they "give wrong results", without specifying how the right result would look like. – nicola Oct 20 '15 at 12:39
  • Added expected output! :D – Thorst Oct 20 '15 at 12:50
  • As it stated, you are actually not joining tables, but just adding columns. You can do that through `cbind`. If the `Date` column is the same for every `data.frame`, just adding columns would be fine. – nicola Oct 20 '15 at 12:53
  • The `Data` column is not the same for every `data.frame`, which is also why I use a master-data-key as `d`. So, `cbind` doesn't work. I need to use a `left_join` variant which can handle the double hours. – Thorst Oct 20 '15 at 13:10
  • After giving a second look, the results obtained from `merge` and `left_join` are correct and your desired output is not. Check `as.numeric(df1$Date)` and `as.numeric(df2$Date)`. Do you see it? (Hint: summer time) – nicola Oct 20 '15 at 13:19
  • @nicola - No. You haven't understood it. df1 and df2 both have a length of 25 obs, because it is the daylight savings time. When you merge or join them, you get a new df - which is 26 long. No longer 25 long. I get that the two time series are different if you use as numeric, hence that is why I choose the precise daylight savings day.... My problem is merging them to a frame that is 25 long. – Thorst Oct 21 '15 at 06:08
  • I haven't been clear enough. During the time shift, you have twice the 2AM, but they correspond to different moments. The first 2AM is after the 1:59:59 AM and the second is after the 2:59:59. They are printed in the same way, but they are different datetimes (as shown by `as.numeric`). In `df1` both these datetimes are present, while in `df2` only the first one appears (twice). So, correctly, the first datetime is matched with both the lines of `df2`, while the second one with neither. Hope this time I have been clearer than before. – nicola Oct 21 '15 at 06:55
  • Yup, thanks for the clarification. But, it still doesn't solve the problem :) – Thorst Oct 21 '15 at 09:22
  • Your problem can't be solved if the situation stays as it is. Why `df2` as twice the same datetime? I guess that it's an error. You need to produce your data correctly to start with and then merge. – nicola Oct 21 '15 at 09:30
  • I am binding it from different sources, that are similiar to df2. The problem is that you can't convert character with DST to the correct daylight-savings posixct. See http://stackoverflow.com/questions/13156836/character-posixct-conversion-in-r-causes-wrong-timezone-values-on-daylight-savin?rq=1 – Thorst Oct 21 '15 at 10:10
  • Of course you cannot since they are ambiguous. However, if repeated datetimes occur only in this instance, you can: `sames<-c(FALSE,diff(df2$Date)==0);df2$Date[sames]<-df2$Date[sames] + 3600`, i.e. you manually add an hour to the second repeated datetime. – nicola Oct 21 '15 at 10:35

0 Answers0