Merge two dataframes based on common two columns & if values in third column are nearest to each other

Question

I am stuck in merging two data-sets the simple complexity of which goes well-beyond my fluency in R. I tried to learn from here and here but could not solve my problem. I am trying to merge the following two data frames:

df1

No    County       Route      Number
1     Anderson       SR009       6150
2     Anderson       SR061       5880
3     Bedford        SR016       9500
4     Bedford        SR130       320
5       .
6       .
7       .
8       .

df2

No.  County        Route     Number1    abc      def
1    Clay          02264     4500        50       789
2     Dickson       01544     5870       45       33
3     Anderson      01421     981        70       65
4     Anderson      SR009     10000      56       56
5     Anderson      SR009     6145       32       53
6     Bedford       SR016     7500       23       32
7     Anderson      SR061     4400       12       24
8     Anderson      SR061     5875       87       26
9     Anderson      SR061     15000      45       45
10     Bedford       SR016     22000     71       75
11     Bedford       SR016     9450      145      615
12     Bedford       SR130     900       7854     76
13     Bedford       SR130     310       124      25
14     Anderson      SR061     5865      312      123
       .
       .
       .

First, the "county" and "Route" columns in df1 and df2 should be compared, and if they exactly match, then specific row of df2$Number1 should be selected whose value is NEAREST to df1$Number, and as such all unique df2 columns should be added to df1

Here is the pseudo-code of what i am trying to achieve:

if(df1$County == Anderson & df2$County == Anderson) && if(df1$Route == SR009 & df2$Route == SR009) 
then select specific row from df2$Number1 whose value is nearest to the df1$Number value, 
and add all subsequent columns of df2 to corresponding row in df1

An example:

Based on "County" and "Route" columns, row No. 1 in df1 matches with row 4 and 5 in df2. Now out of the two df2 rows that match with first row in df1, i want to select that specific row in df2 whose "Number1" value is closest to the "Number" value in df1 i.e. 6150. Said this, i want to select row 5 in df2 because "Number1" value is 6145 which is closest to 6150, and add all subsequent columns from df2 to df1...

The final output would look like this:

No      County           Route       Number     Number1     abc    def  .  .    
1       Anderson         SR009       6150       6145        32     53   .  .
2       Anderson         SR061       5880       5875        87     26   .  .
3       Bedford          SR016       9500       9450        145    615  .  .
4       Bedford          SR139       320        310         124    25   .  .
.          .
.          .

I highly appreciate any help on this. Sorry for the long post.

Jacob H · Answer 1 · 2016-10-15T00:34:23.740

0

Your question is a bit confusing. That withstanding, from your desired output I think the following dplyr approach will work for you.

library(dplyr)

d1%>%
  full_join(d2, by = c("County", "Route")) %>%
  group_by(County, Route) %>%
  mutate(myDiff = abs(Number - Number1)) %>%
  slice(which.min(myDiff))

edited Oct 15 '16 at 00:34

answered Oct 15 '16 at 00:27

Jacob H

4,317
2
32
39

Thank you for your response. – B.W Niazi Oct 15 '16 at 00:58
Warning messages: 1: In full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) : joining factors with different levels, coercing to character vector 2: In full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) : joining factors with different levels, coercing to character vector – B.W Niazi Oct 15 '16 at 00:59
1

@B.WNiazi Me too. However, this is not an error but a warning which can be ignored in this context. In plain English the warning is saying that some of the variables in your data are factors. However, the levels of these factors are not consistent across data frames. Therefore `R` is going to convert the factors into characters before the merge/join. – Jacob H Oct 15 '16 at 01:37
I appreciate your guidance. It returns an empty dataframe with 0 observations – B.W Niazi Oct 15 '16 at 01:50

dww · Answer 2 · 2016-10-15T01:45:21.790

Using library(data.table)

setkey(dt1, County, Route)
setkey(dt2, County, Route)
dt3 = dt1[dt2]
dt3[, Number.close := Number1[which.min(abs(Number1-Number))], by = .(County, Route)]
dt3 = dt3[Number.close == Number1, ][, Number.close:=NULL][]

#    No   County Route Number No. Number1 abc def
# 1:  1 Anderson SR009   6150   5    6145  32  53
# 2:  2 Anderson SR061   5880   8    5875  87  26
# 3:  3  Bedford SR016   9500  11    9450 145 615
# 4:  4  Bedford SR130    320  13     310 124  25

The data:

dt1 = structure(list(No = 1:4, County = c("Anderson", "Anderson", "Bedford", 
"Bedford"), Route = c("SR009", "SR061", "SR016", "SR130"), Number = c(6150L, 
5880L, 9500L, 320L)), .Names = c("No", "County", "Route", "Number"
), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County", 
"Route"))

dt2 = structure(list(No. = c(3L, 4L, 5L, 7L, 8L, 9L, 14L, 6L, 10L, 
11L, 12L, 13L, 1L, 2L), County = c("Anderson", "Anderson", "Anderson", 
"Anderson", "Anderson", "Anderson", "Anderson", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "Clay", "Dickson"), Route = c("01421", 
"SR009", "SR009", "SR061", "SR061", "SR061", "SR061", "SR016", 
"SR016", "SR016", "SR130", "SR130", "02264", "01544"), Number1 = c(981L, 
10000L, 6145L, 4400L, 5875L, 15000L, 5865L, 7500L, 22000L, 9450L, 
900L, 310L, 4500L, 5870L), abc = c(70L, 56L, 32L, 12L, 87L, 45L, 
312L, 23L, 71L, 145L, 7854L, 124L, 50L, 45L), def = c(65L, 56L, 
53L, 24L, 26L, 45L, 123L, 32L, 75L, 615L, 76L, 25L, 789L, 33L
)), .Names = c("No.", "County", "Route", "Number1", "abc", "def"
), row.names = c(NA, -14L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County", 
"Route"))

Probably am missing an important point. I get an empty dt3 frame after running the below code... setkey(mydataa, County, Route) setkey(mydata11, County, Route) dt3 = mydataa[mydata11] dt3[, AADT.close := AADT[which.min(abs(AADT2015-AADT))], by = .(County, Route)] dt3 = dt3[AADT.close == AADT2015, ][, AADT.close:=NULL] — B.W Niazi, Oct 15 '16 at 01:36
Are you sure? try `print(dt3)` - see http://stackoverflow.com/questions/32988099/data-table-objects-not-printed-after-returned-from-function - sometimes you have to print twice to see the output after using := in data.table. It is a side effect of another bug fix — dww, Oct 15 '16 at 01:39
I just added an extra `[]` at the end of the final line, to avoid this issue. — dww, Oct 15 '16 at 01:45
I appreciate your efforts. With adding [], now it returns a data frame of 12000 rows (size of df2) while it should return a data table of 336 rows (size of df1), and also all values in dt3 are "NA". — B.W Niazi, Oct 15 '16 at 01:51
Well it seems fine on your minimal example, I guess it must be something to do with a difference between the data you are running it on and your minimal example. Tricky to diagnose without seeing exactly what you are doing — dww, Oct 15 '16 at 02:01

Merge two dataframes based on common two columns & if values in third column are nearest to each other

2 Answers2