Using dplyr left_join for the same data sets with different variables in by argument

Question

Data

I have the output of a driving simulator experiment. I am sharing some data for 2 different drivers changing lanes. Following is the first data set foo:

Data Set # 1

> foo
# A tibble: 4 x 7
                file.ID   lcf         TL lead_veh_TL foll_veh_TL Start_Frame_CS End_Frame1_CS
                  <chr> <int>      <chr>       <chr>       <chr>          <dbl>         <dbl>
1 Cars_20160601_01.hdf5 43207 right_lane      StarT7        <NA>          42899         43476
2 Cars_20160601_01.hdf5 43207 right_lane        <NA> ditiExpeon6          42899         43476
3 Cars_20160601_02.hdf5 52843  left_lane      BMWC10        <NA>          52498         53211
4 Cars_20160601_02.hdf5 52843  left_lane        <NA>    owT8Yell          52498         53211

where,
* file.ID = Unique ID of a driving scenario
* lcf = Time frame # when the vehicle touched lane marking
* TL = Target Lane (where the vehicle goes at the end of lane change)
* lead_veh_TL = Name of lead vehicle in target lane
* foll_veh_TL = Name of following vehicle in target lane
* Start_Frame_CS = Time frame # when the lane change started in origin lane
* End_Frame1_CS = Time frame # when the lane change ended in the target lane

Here's an illustration for file.ID=="Cars_20160601_01.hdf5" scenario:

Data Set # 2

The second data frame consists of speed of all vehicles at all times (including the times when lane change occurred). Following are few rows:

> bar
# A tibble: 205,231 x 5
                 file.ID frames      lane ADO_name speed.kph
                   <chr>  <int>     <chr>    <chr>     <dbl>
 1 Cars_20160601_01.hdf5  35002 left_lane   BMWC10  80.62273
 2 Cars_20160601_01.hdf5  35003 left_lane   BMWC10  80.72590
 3 Cars_20160601_01.hdf5  35004 left_lane   BMWC10  80.83455
 4 Cars_20160601_01.hdf5  35005 left_lane   BMWC10  80.94342
 5 Cars_20160601_01.hdf5  35006 left_lane   BMWC10  81.05671
 6 Cars_20160601_01.hdf5  35007 left_lane   BMWC10  81.17065
 7 Cars_20160601_01.hdf5  35008 left_lane   BMWC10  81.28705
 8 Cars_20160601_01.hdf5  35009 left_lane   BMWC10  81.40385
 9 Cars_20160601_01.hdf5  35010 left_lane   BMWC10  81.52023
10 Cars_20160601_01.hdf5  35011 left_lane   BMWC10  81.63548
# ... with 205,221 more rows

where, * frames = Time frame #
* lane = current lane
* ADO_name = name of the vehicle (it includes both lead and following vehicles in target lane)
* speed.kph = speed of the vehicle in current time frames

bar data set is not small enough to completely reproduce here as it contains both lane-change and non-lane-change time frames. They are also required in this question. So, I have uploaded bar on Google Drive. You can download it here: https://drive.google.com/open?id=0ByvW4Hq_6a56dnIxYWh6M2ZRTUE (csv file)

Code to load csv file `bar`:

library(tibble)
bar <- as_tibble(read.csv("bar.csv", header=TRUE))

What I want to do

I want to use the bar and foo data sets to:
1. Extract speeds of lead and following vehicles at START FRAME of LANE CHANGE (Start_Frame_CS)
2. Extract speeds of lead and following vehicles at LANE CHANGE FRAME(lcf)
3. Extract speeds of lead and following vehicles at END FRAME of LANE CHANGE (End_Frame1_CS)
4. Extract mean speed of lead and following vehicles during lane change i.e. mean of ALL the speeds including and between Start_Frame_CS and End_Frame1_CS

What I have tried

I can manually do this by using dplyr::left_join multiple times. Following is how I extract speeds of lead_veh_TL at lcf & Start_Frame_CS:

Lead Veh Speed at Lane Change Frame

library(dplyr)
lead_veh_TL_lcf <- foo %>% 
  select(-ends_with("CS"), -foll_veh_TL) %>% 
  left_join(x=., y = bar,
            by = c("file.ID"="file.ID","lcf"="frames", 
                   "TL" = "lane", "lead_veh_TL" = "ADO_name") )%>% 
  filter(!(is.na(lead_veh_TL)==TRUE)) %>% 
  rename(speed.kph_LV_TL_lcf = speed.kph)  

> lead_veh_TL_lcf
# A tibble: 2 x 5
                file.ID   lcf         TL lead_veh_TL speed.kph_LV_TL_lcf
                  <chr> <int>      <chr>       <chr>               <dbl>
1 Cars_20160601_01.hdf5 43207 right_lane      StarT7            79.54961
2 Cars_20160601_02.hdf5 52843  left_lane      BMWC10           103.71717

Lead Veh Speed at Start Frame

lead_veh_TL_SF <- foo %>% 
  select(-lcf, -foll_veh_TL, -End_Frame1_CS) %>% 
  left_join(x=., y = bar,
            by = c("file.ID"="file.ID","Start_Frame_CS"="frames", 
                   "TL" = "lane", "lead_veh_TL" = "ADO_name") )%>% 
  filter(!(is.na(lead_veh_TL)==TRUE)) %>% 
  rename(speed.kph_LV_TL_SF = speed.kph)  

> lead_veh_TL_SF
# A tibble: 2 x 5
                file.ID         TL lead_veh_TL Start_Frame_CS speed.kph_LV_TL_SF
                  <chr>      <chr>       <chr>          <dbl>              <dbl>
1 Cars_20160601_01.hdf5 right_lane      StarT7          42899           79.54841
2 Cars_20160601_02.hdf5  left_lane      BMWC10          52498          102.87223

mean speed of Lead Vehicle

foo_mean_LV <- bar %>%
  left_join(x =., y = foo %>% select(-lcf,  -foll_veh_TL), 
            by = c("file.ID" = "file.ID")) %>% 
  group_by(file.ID) %>% 
  filter(frames>=Start_Frame_CS & frames<=End_Frame1_CS, ADO_name==lead_veh_TL) %>% 
  ungroup() %>% 
  group_by(file.ID, lead_veh_TL) %>% 
  summarize(Start_Frame_CS = unique(Start_Frame_CS),
            End_Frame1_CS = unique(End_Frame1_CS),
            mean_sp_LV = mean(speed.kph),
            sd_sp_LV = sd(speed.kph)) %>% 
  ungroup()

> foo_mean_LV
# A tibble: 2 x 6
                file.ID lead_veh_TL Start_Frame_CS End_Frame1_CS mean_sp_LV    sd_sp_LV
                  <chr>       <chr>          <dbl>         <dbl>      <dbl>       <dbl>
1 Cars_20160601_01.hdf5      StarT7          42899         43476   79.54532 0.006486832
2 Cars_20160601_02.hdf5      BMWC10          52498         53211  100.94923 1.608811109

For the Following Vehicle, I can simply replace the lead_veh_TL in the above code with foll_veh_TL.

Problem

As you can see, writing code repeatedly in this manner is tedious and also error-prone. I want to use a function where I could just provide the time frame and type of vehicle (lead/following) and everything else remains the same. However, I can't seem to find a way to write such a function. I only found one related answer here. But that doesn't solve my problem.

Please guide me how can I write an efficient function to get the desired results. My original data set has many more variables along with the speed.kph variable

From what I understand from your question, you want to program with **dplyr**. You should read the new vignette about the new tidyeval way of using **dplyr** (http://dplyr.tidyverse.org/articles/programming.html). — F. Privé, Jul 24 '17 at 12:41
Against my better judgement I downloaded and attempted to load the bar.rds file. It resulted in an error message. I suggest you instead post a text file and the code to load and transform it to t=its desired state. — IRTFM, Jul 24 '17 at 16:05
@42- Thanks for pointing out the error. I have uploaded the file as a csv and included the code to load it. — umair durrani, Jul 24 '17 at 17:13