Left Join in R (dplyr) - Too many observations?

Question

I'm using dplyrs left join function in order to match two dataframes.

I have a panel data set A which consists of 4708 rows and 2 columns ID and Name:

ID Name
1  Option1
1  Option2
1  Option3
2  Option2
2  Option3
3  Option1
3  Option4

My dataset B consists of single definitions and categories for each name column (86 rows):

Name        Definition  Category
Option1     Def1         1
Option2     Def2         1
Option3     Def2         2
Option4     Def3         2

So in the end I need following data set C which links the columns of B to A:

ID Name      Definition   Category
1  Option1   Def1         1
1  Option2   Def2         1
1  Option3   Def2         2
2  Option2   Def2         1
2  Option3   Def2         2
3  Option1   Def1         1
3  Option4   Def3         2

I used a left_join command in dplyr to do this:

Data C <- left_join(A,B, by="name")

However, for some reason I got 5355 rows instead of the original 4708, so rows were some added. My understanding was that left_join simply assigns the definitions & categories of B to data set A.

Why do I get more rows ? Or are there any other ways to get the desired data frame C?

Probably related [Why does the result from merge have more rows than original file?](https://stackoverflow.com/questions/24150765/why-does-the-result-from-merge-have-more-rows-than-original-file); [Merging data frames without duplicating rows](https://stackoverflow.com/questions/8828870/merging-data-frames-without-duplicating-rows). — Henrik, Mar 13 '18 at 13:04
sounds like multiple matching so `B` has multiple entries to `A$name` — Stephan, Mar 13 '18 at 13:08

score 19 · Accepted Answer · answered Mar 13 '18 at 13:09

19

With left_join(A, B) new rows will be added wherever there are multiple rows in B for which the key columns (same-name columns by default) match the same, single row in A. For example:

library(dplyr)
df1 <- data.frame(col1 = LETTERS[1:4],
                  col2 = 1:4)
df2 <- data.frame(col1 = rep(LETTERS[1:2], 2),
                  col3 = 4:1)

left_join(df1, df2)  # has 6 rows rather than 4

answered Mar 13 '18 at 13:09

Jordi

1,313
8
13

1

What is the solution for this? You haven't mentioned that. – user20203146 Dec 14 '22 at 18:07
Please provide a solution for the same. – Mr Pool Dec 14 '22 at 18:22
The solution is to eliminate duplicate keys before you do the join. – Brandon Apr 14 '23 at 20:27

score 4 · Answer 2 · answered Mar 13 '18 at 13:26

It's hard to know without seeing your original data, but if data frame B does not contain unique values on the join columns, you will get repeated rows from data frame A whenever this happens. You could try:

data_frame_b %>% count(join_col_1, join_col_2)

Which will let you know if there are non-unique combinations of the two variables.

score 1 · Answer 3 · edited Oct 16 '21 at 08:07

1

More rows may also appear if you have NA values in both A's and B's names on which you join. So make sure you exclude those.

edited Oct 16 '21 at 08:07

ah bon

9,293
12
65
148

answered Mar 02 '20 at 12:41

Imitation

104
7

score 0 · Answer 4 · answered Nov 03 '21 at 21:17

0

I had a similar case. As other answers already mentioned, make sure you have unique values in the columns you're joining:

df_to_join <- unique(df2)
joined_df <- left_join(df1, df_to_join, by="name")

answered Nov 03 '21 at 21:17

vhio

145
7

Left Join in R (dplyr) - Too many observations?

4 Answers4