left_join (dplyr) the next available date

Question

I have 2 datasets in "R".

The first DB contains specific dates:

    Value       Date   
#   20          2017-10-19 
#   19          2017-10-23 
#   19          2017-11-03 
#   20          2017-11-10

And the second contains the level of an stock index from the last 5 years

     Date       Index
#    2017-11-10 13.206,35
#    2017-11-03 13.378,96
#    2017-10-25 13.404,58
#    2017-10-19 13.517,98

Now I want to merge by searching for the dates from the first dataset "DB" and adding the correct Index value for this date from the second dataset "Hist".

What I did is using the left_join function:

DB <- left_join(DB, Hist, by = "Date")

The problem is some dates in the first dataset are public holidays where no data is available in the second dataset "Hist". So I have some "NA".

  Value   Date         Index
# 20      2017-10-19   13.517,98
# 19      2017-10-23   NA
# 19      2017-11-03   13.378,96
# 20      2017-11-10   13.206,35

What I'm looking for is to take the value of the next available date instead of adding NA.

Example: Instead of adding NA taking the index of 2017-10-25 (2 days later)

  Value   Date         Index
# 20      2017-10-19   13.517,98
# 19      2017-10-23   13.404,58
# 19      2017-11-03   13.378,96
# 20      2017-11-10   13.206,35

Has anybody an idea. Thanks in advance!

score 5 · Accepted Answer · answered Nov 11 '17 at 23:17

Original Request

The following is an option. It uses full_join, and then the fill function to impute the missing value.

library(tidyverse)

DB_final <- DB %>%
  full_join(Hist, by = "Date") %>%
  arrange(Date) %>%
  fill(Index, .direction = "up") %>%
  filter(!is.na(Value))
DB_final
#   Value       Date     Index
# 1    20 2017-10-19 13.517,98
# 2    19 2017-10-23 13.404,58
# 3    19 2017-11-03 13.378,96
# 4    20 2017-11-10 13.206,35

However, the user needs to know the fill direction (up or down) in advance. It may not be useful if the user does not know that.

Impute Missing Value based on the Nearest Date

Here is another option, which I think is more robust. It will impute the missing value use the Index from the nearest date.

Step 1: Find the Nearest Date

# Collect all dates
Date_vec <- sort(unique(c(DB$Date, Hist$Date)))

# Create a distance matrix based on dates than convert to a data frame
dt <- Date_vec %>%
  dist() %>%
  as.matrix() %>%
  as.data.frame() %>%
  rowid_to_column(var = "ID") %>%
  gather(ID2, Value, -ID) %>%
  mutate(ID2 = as.integer(ID2)) %>%
  filter(ID != ID2) %>%
  arrange(ID, Value) %>%
  group_by(ID) %>%
  slice(1) %>%
  select(-Value)

dt$ID <- Date_vec[dt$ID]
dt$ID2 <- Date_vec[dt$ID2]  

names(dt) <- c("Date1", "Date2")

dt
# # A tibble: 5 x 2
# # Groups:   ID [5]
#       Date1      Date2
#      <date>     <date>
# 1 2017-10-19 2017-10-23
# 2 2017-10-23 2017-10-25
# 3 2017-10-25 2017-10-23
# 4 2017-11-03 2017-11-10
# 5 2017-11-10 2017-11-03

dt shows the nearest date of all the dates.

Step 2: Perform multiple join

Join DB and dt, and then join Hist twice based on different date columns.

DB2 <- DB %>% left_join(dt, by = c("Date" = "Date1")) 

DB3 <- DB2 %>%
  left_join(Hist, by = "Date") %>%
  left_join(Hist, by = c("Date2" = "Date")) 
DB3
#   Value       Date      Date2   Index.x   Index.y
# 1    20 2017-10-19 2017-10-23 13.517,98      <NA>
# 2    19 2017-10-23 2017-10-25      <NA> 13.404,58
# 3    19 2017-11-03 2017-11-10 13.378,96 13.206,35
# 4    20 2017-11-10 2017-11-03 13.206,35 13.378,96

Step 3: Finalize the Index

If there are values in Index.x, use that, otherwise, use the values in Index.y.

DB4 <- DB3 %>% 
  mutate(Index = ifelse(is.na(Index.x), Index.y, Index.x)) %>%
  select(Value, Date, Index)
DB4
#   Value       Date     Index
# 1    20 2017-10-19 13.517,98
# 2    19 2017-10-23 13.404,58
# 3    19 2017-11-03 13.378,96
# 4    20 2017-11-10 13.206,35

DB4 is the final output.

DATA

DB <- structure(list(Value = c(20L, 19L, 19L, 20L), Date = structure(c(17458, 
                                                                       17462, 17473, 17480), class = "Date")), class = "data.frame", .Names = c("Value", 
                                                                                                                                                "Date"), row.names = c(NA, -4L))


Hist <- structure(list(Date = structure(c(17480, 17473, 17464, 17458), class = "Date"), 
                       Index = c("13.206,35", "13.378,96", "13.404,58", "13.517,98"
                       )), class = "data.frame", .Names = c("Date", "Index"), row.names = c(NA, 
                                                                                            -4L))

Thanks a lot. Already your first option fulfills my need since I know the direction. And this is pretty slim code. Great, thank you. — MelBourbon, Nov 12 '17 at 09:52

amarchin · Answer 2 · 2017-11-11T23:08:33.577

1

A solution could be

library(dplyr)
library(rlang)

clean_df <- function(df) {

  ix <- which(is.na(df$Index))
  df$Index[ix] <- df$Index[ix + 1]

  filter(df, !is.na(.data$Value))

}

full_join(DB, Hist) %>%
  arrange(Date) %>%
  clean_df()

edited Nov 11 '17 at 23:08

answered Nov 11 '17 at 22:54

amarchin

2,044
1
16
32

npa1994 · Answer 3 · 2017-11-12T00:46:11.253

What you have done, plus as.Date() to format dates:

library(data.table)
library(dplyr)

DB = data.table(
  Value = c(20,19,19,29),
  Date = c("2017-10-19","2017-10-23","2017-11-03","2017-11-10")
  )

Hist = data.table(
  Date = c("2017-11-10","2017-11-03","2017-10-25","2017-10-19"),
  Index = c("13.206,35","13.378,96","13.404,58","13.517,98")
  )

DB[, Date := as.Date(Date)]
Hist[, Date := as.Date(Date)]

DB <- left_join(DB,Hist,by="Date") %>% as.data.table()

Now perform the steps below:

# Get rows which are missing an Index.
DB_na <- DB[is.na(Index),]
DB <- DB[!is.na(Index),]

# Build function to find appropriate Index, given an na_date.
get_na_index <- function(na_date) {
  bigger_dates = DB[Date>na_date,]
  index = bigger_dates[which.min(other_dates-na_date), Index]
  return(index)
}

# Use apply() to perform row-wise operation.
DB_na$Index <- apply(matrix(DB_na$Date), 1, get_na_index)

# Combine rows
DB <- rbind(DB, DB_na) %>% arrange(Date)

The Output:

DB

  Value       Date     Index
1    20 2017-10-19 13.517,98
2    19 2017-10-23 13.378,96
3    19 2017-11-03 13.378,96
4    29 2017-11-10 13.206,35

Thanks, this works but is not solving my problem. Because it seems that it takes the next index value of the joint "DB". I want to search for the next Index value in "Hist" if NA. — MelBourbon, Nov 12 '17 at 09:29