Group two dfs based on dates that closely match

Question

These are subsets of two dataframes.

df1:

plot	mean_first_flower_date	gdd
1	2019-07-15	60
1	2019-07-21	50
1	2019-07-23	78
2	2019-05-13	100
2	2019-05-22	173
2	2019-05-25	245

(cont.)

df2:

plot	date	flowers
1	2019-07-12	2
1	2019-07-13	9
1	2019-07-14	3
1	2019-07-15	3
2	2019-05-12	10
2	2019-05-13	10
2	2019-05-14	14
2	2019-05-15	17

(cont.)

df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).

I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.

This will happen if, for example, I inner_join both dfs because the dates will not match.

So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.

My goal is to have something like this:

plot	date	flowers	gdd
1	2019-07-12	2	60
1	2019-07-13	9	60
1	2019-07-14	3	60
1	2019-07-15	3	60
2	2019-05-12	10	100
2	2019-05-13	10	100
2	2019-05-14	14	100
2	2019-05-15	17	100

Is it possible to do?

I greatly appreciate any help! Thanks!

Should the dates for plot==2 in `df1` be in May instead of July? — thelatemail, Feb 24 '21 at 01:06
You want to fill `NA`s in resulting join, but what's the criteria for filling/adopting in case where there are two or more values in one plot? — AnilGoyal, Feb 24 '21 at 04:44

score 1 · Answer 1 · answered Feb 24 '21 at 01:29

1

I think a 'rolling join' from the data.table package can handle this:

library(data.table)
setDT(df1)
setDT(df2)

df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]

df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]

#   plot mean_first_flower_date gdd flowers
#1:    1             2019-07-12  60       2
#2:    1             2019-07-13  60       9
#3:    1             2019-07-14  60       3
#4:    1             2019-07-15  60       3
#5:    2             2019-05-12 100      10
#6:    2             2019-05-13 100      10
#7:    2             2019-05-14 100      14
#8:    2             2019-05-15 100      17

Using this data:

df1 <- read.table(text="plot  mean_first_flower_date  gdd
1  2019-07-15  60
1  2019-07-21  50
1  2019-07-23  78
2  2019-05-13  100
2  2019-05-22  173
2  2019-05-25  245", header=TRUE)

df2 <- read.table(text="plot  date  flowers
1  2019-07-12  2
1  2019-07-13  9
1  2019-07-14  3
1  2019-07-15  3
2  2019-05-12  10
2  2019-05-13  10
2  2019-05-14  14
2  2019-05-15  17", header=TRUE)

answered Feb 24 '21 at 01:29

thelatemail

91,185
12
128
188

I got this message after running the last line in the code:Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 33674 rows; more than 31079 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. – user3508884 Feb 24 '21 at 02:23
1

Try adding the suggested argument to the call - `df1[df2, ..., allow.cartesian=TRUE]` as the message says. Or *Check for duplicate key values* in the `mean_first_flower_date` - if you have many dates matching to many +3/-3 date ranges you will get duplication. – thelatemail Feb 24 '21 at 02:28
@user3508884 Please share some rows where more than one row matches in one plot. In the sample data shared, there is only one match per plot. – AnilGoyal Feb 24 '21 at 07:32

score 1 · Answer 2 · answered Feb 24 '21 at 07:46

Try fill from dplyr. use this syntax

df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
  fill(gdd, .direction = "up")

  plot       date flowers gdd
1    1 2019-07-12       2  60
2    1 2019-07-13       9  60
3    1 2019-07-14       3  60
4    1 2019-07-15       3  60
5    2 2019-05-12      10 100
6    2 2019-05-13      10 100
7    2 2019-05-14      14  NA
8    2 2019-05-15      17  NA

As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"

df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
  fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")

  plot       date flowers gdd
1    1 2019-07-12       2  60
2    1 2019-07-13       9  60
3    1 2019-07-14       3  60
4    1 2019-07-15       3  60
5    2 2019-05-12      10 100
6    2 2019-05-13      10 100
7    2 2019-05-14      14 100
8    2 2019-05-15      17 100

Group two dfs based on dates that closely match

2 Answers2