1

These are subsets of two dataframes.

df1:

plot mean_first_flower_date gdd
1 2019-07-15 60
1 2019-07-21 50
1 2019-07-23 78
2 2019-05-13 100
2 2019-05-22 173
2 2019-05-25 245

(cont.)

df2:

plot date flowers
1 2019-07-12 2
1 2019-07-13 9
1 2019-07-14 3
1 2019-07-15 3
2 2019-05-12 10
2 2019-05-13 10
2 2019-05-14 14
2 2019-05-15 17

(cont.)

df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).

I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.

This will happen if, for example, I inner_join both dfs because the dates will not match.

So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.

My goal is to have something like this:

plot date flowers gdd
1 2019-07-12 2 60
1 2019-07-13 9 60
1 2019-07-14 3 60
1 2019-07-15 3 60
2 2019-05-12 10 100
2 2019-05-13 10 100
2 2019-05-14 14 100
2 2019-05-15 17 100

Is it possible to do?

I greatly appreciate any help! Thanks!

user3508884
  • 75
  • 1
  • 1
  • 8

2 Answers2

1

I think a 'rolling join' from the data.table package can handle this:

library(data.table)
setDT(df1)
setDT(df2)

df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]

df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]

#   plot mean_first_flower_date gdd flowers
#1:    1             2019-07-12  60       2
#2:    1             2019-07-13  60       9
#3:    1             2019-07-14  60       3
#4:    1             2019-07-15  60       3
#5:    2             2019-05-12 100      10
#6:    2             2019-05-13 100      10
#7:    2             2019-05-14 100      14
#8:    2             2019-05-15 100      17

Using this data:

df1 <- read.table(text="plot  mean_first_flower_date  gdd
1  2019-07-15  60
1  2019-07-21  50
1  2019-07-23  78
2  2019-05-13  100
2  2019-05-22  173
2  2019-05-25  245", header=TRUE)

df2 <- read.table(text="plot  date  flowers
1  2019-07-12  2
1  2019-07-13  9
1  2019-07-14  3
1  2019-07-15  3
2  2019-05-12  10
2  2019-05-13  10
2  2019-05-14  14
2  2019-05-15  17", header=TRUE)
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • I got this message after running the last line in the code:Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 33674 rows; more than 31079 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. – user3508884 Feb 24 '21 at 02:23
  • 1
    Try adding the suggested argument to the call - `df1[df2, ..., allow.cartesian=TRUE]` as the message says. Or *Check for duplicate key values* in the `mean_first_flower_date` - if you have many dates matching to many +3/-3 date ranges you will get duplication. – thelatemail Feb 24 '21 at 02:28
  • @user3508884 Please share some rows where more than one row matches in one plot. In the sample data shared, there is only one match per plot. – AnilGoyal Feb 24 '21 at 07:32
1

Try fill from dplyr. use this syntax

df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
  fill(gdd, .direction = "up")

  plot       date flowers gdd
1    1 2019-07-12       2  60
2    1 2019-07-13       9  60
3    1 2019-07-14       3  60
4    1 2019-07-15       3  60
5    2 2019-05-12      10 100
6    2 2019-05-13      10 100
7    2 2019-05-14      14  NA
8    2 2019-05-15      17  NA

As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"

df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
  fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")

  plot       date flowers gdd
1    1 2019-07-12       2  60
2    1 2019-07-13       9  60
3    1 2019-07-14       3  60
4    1 2019-07-15       3  60
5    2 2019-05-12      10 100
6    2 2019-05-13      10 100
7    2 2019-05-14      14 100
8    2 2019-05-15      17 100
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45