18

I'm using dplyrs left join function in order to match two dataframes.

I have a panel data set A which consists of 4708 rows and 2 columns ID and Name:

ID Name
1  Option1
1  Option2
1  Option3
2  Option2
2  Option3
3  Option1
3  Option4

My dataset B consists of single definitions and categories for each name column (86 rows):

Name        Definition  Category
Option1     Def1         1
Option2     Def2         1
Option3     Def2         2
Option4     Def3         2

So in the end I need following data set C which links the columns of B to A:

ID Name      Definition   Category
1  Option1   Def1         1
1  Option2   Def2         1
1  Option3   Def2         2
2  Option2   Def2         1
2  Option3   Def2         2
3  Option1   Def1         1
3  Option4   Def3         2

I used a left_join command in dplyr to do this:

Data C <- left_join(A,B, by="name")

However, for some reason I got 5355 rows instead of the original 4708, so rows were some added. My understanding was that left_join simply assigns the definitions & categories of B to data set A.

Why do I get more rows ? Or are there any other ways to get the desired data frame C?

s_baldur
  • 29,441
  • 4
  • 36
  • 69
Kosta S.
  • 339
  • 1
  • 2
  • 11
  • Probably related [Why does the result from merge have more rows than original file?](https://stackoverflow.com/questions/24150765/why-does-the-result-from-merge-have-more-rows-than-original-file); [Merging data frames without duplicating rows](https://stackoverflow.com/questions/8828870/merging-data-frames-without-duplicating-rows). – Henrik Mar 13 '18 at 13:04
  • 1
    sounds like multiple matching so `B` has multiple entries to `A$name` – Stephan Mar 13 '18 at 13:08

4 Answers4

19

With left_join(A, B) new rows will be added wherever there are multiple rows in B for which the key columns (same-name columns by default) match the same, single row in A. For example:

library(dplyr)
df1 <- data.frame(col1 = LETTERS[1:4],
                  col2 = 1:4)
df2 <- data.frame(col1 = rep(LETTERS[1:2], 2),
                  col3 = 4:1)

left_join(df1, df2)  # has 6 rows rather than 4
Jordi
  • 1,313
  • 8
  • 13
4

It's hard to know without seeing your original data, but if data frame B does not contain unique values on the join columns, you will get repeated rows from data frame A whenever this happens. You could try:

data_frame_b %>% count(join_col_1, join_col_2)

Which will let you know if there are non-unique combinations of the two variables.

paleolimbot
  • 406
  • 2
  • 3
1

More rows may also appear if you have NA values in both A's and B's names on which you join. So make sure you exclude those.

ah bon
  • 9,293
  • 12
  • 65
  • 148
Imitation
  • 104
  • 7
0

I had a similar case. As other answers already mentioned, make sure you have unique values in the columns you're joining:

df_to_join <- unique(df2)
joined_df <- left_join(df1, df_to_join, by="name")
vhio
  • 145
  • 7