1

I am attempting to inner_join two data frames, each with three columns. The first data frame contains date, variable names, and forecast values while the second data frame contains date, variable names, and actual values. The purpose of my join is to match forecast values to actual values by date and the correct weather variable name for accuracy analysis.

My first data frame x is as follows

 x <- structure(list(Date = structure(c(1588060800, 1588060800, 1588060800, 
1588060800, 1588060800, 1588060800, 1588060800, 1588060800, 1588060800, 
1588060800, 1588060800, 1588060800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), wx_vars = c("Wx1_Temperature", "Wx1_Precipitation", 
"Wx1_CloudCover", "Wx1_DewPoint", "Wx1_WindSpeed", "Wx1_SolarRadiation", 
"Wx2_Temperature", "Wx2_Precipitation", "Wx2_CloudCover", "Wx2_DewPoint", 
"Wx2_WindSpeed", "Wx2_SolarRadiation"), wx_forecast = c(56.92, 
0.0046, 77.46, 50.26, 7.42, 12.93, 57.05, 0.0037, 68.3, 50.5, 
7.32, 19.02)), row.names = c(NA, 12L), class = "data.frame")

Data Frame x

My second frame y is as follows:

y <- structure(list(Date = structure(c(1588057200, 1588057200, 1588057200, 
1588057200, 1588060800, 1588060800, 1588060800, 1588060800, 1588060800, 
1588060800, 1588064400, 1588064400), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), wx_vars = c("Actual_CloudCover", "Actual_WindSpeed", 
"Actual_Precipitation", "Actual_SolarRadiation", "Actual_Temperature", 
"Actual_DewPoint", "Actual_CloudCover", "Actual_WindSpeed", "Actual_Precipitation", 
"Actual_SolarRadiation", "Actual_Temperature", "Actual_DewPoint"
), wx_actuals = c(54.8, 5.63, 0, 26.1, 57.32, 49.99, 61, 7.24, 
0.00015, 23.4, 59.84, 52.11)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -12L))

data frame y

As you can see, the x forecast date frame has two separate weather forecasts that I want to evaluate the accuracy of: Wx1 and Wx2. Each row has a different weather variable listed after Wx1 or Wx2 which line up exactly with the actual data in data frame y. The Dates are in the same format in both x and y and would work in a simple inner_join, but given the string differences in the wx_vars column of both data frame x and y, I've been attempting to work with fuzzyjoin. No luck yet.

Here is what I've tried, and my biggest issue is what I included as my question title. I have yet to find an example of a fuzzyjoin on two columns where one column match would work in a regular join while the other column match requires fuzzyjoin, in my case partial string matching.

wx_analysis_1<- fuzzy_inner_join(x, y, by = c("Date", "wx_vars"="wx_vars"), match_fun = str_detect)
wx_analysis_2 <- regex_inner_join(x, y, by = c("Date", "wx_vars"="wx_vars"))

No luck with either of these solutions. Is there some syntax I'm missing to handle the date join better? I feel like fuzzy_inner_join by wx_vars should work, but the Date column might be my issue.

Thanks

user3720887
  • 719
  • 1
  • 11
  • 18
  • you'll need to do something like match_fun = list(`==`, stringdist) BUT Bas is right - no need for fuzzyjoin if you just do some text manipulation and separate your variable! – Arthur Yip Oct 20 '20 at 07:42

1 Answers1

2

Since you know the rules by which the columns should match, it is easier to use those explicitly:

x <- x %>% 
  separate(wx_vars, c("type_wx", "wx_vars"), "_")

y <- y %>% 
  separate(wx_vars, c("type", "wx_vars"), "_")


x %>% 
  inner_join(y, by = c("Date", "wx_vars"))

#                   Date type_wx        wx_vars wx_forecast   type wx_actuals
# 1  2020-04-28 08:00:00     Wx1    Temperature     56.9200 Actual   57.32000
# 2  2020-04-28 08:00:00     Wx1  Precipitation      0.0046 Actual    0.00015
# 3  2020-04-28 08:00:00     Wx1     CloudCover     77.4600 Actual   61.00000
# 4  2020-04-28 08:00:00     Wx1       DewPoint     50.2600 Actual   49.99000
# 5  2020-04-28 08:00:00     Wx1      WindSpeed      7.4200 Actual    7.24000
# 6  2020-04-28 08:00:00     Wx1 SolarRadiation     12.9300 Actual   23.40000
# 7  2020-04-28 08:00:00     Wx2    Temperature     57.0500 Actual   57.32000
# 8  2020-04-28 08:00:00     Wx2  Precipitation      0.0037 Actual    0.00015
# 9  2020-04-28 08:00:00     Wx2     CloudCover     68.3000 Actual   61.00000
# 10 2020-04-28 08:00:00     Wx2       DewPoint     50.5000 Actual   49.99000
# 11 2020-04-28 08:00:00     Wx2      WindSpeed      7.3200 Actual    7.24000
# 12 2020-04-28 08:00:00     Wx2 SolarRadiation     19.0200 Actual   23.40000
Bas
  • 4,628
  • 1
  • 14
  • 16