merge results in more rows than one of the data frames

Question

I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.

However, I get more than 9994 rows upon merge. How can I make sure this does not happen?

df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994

df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431

df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057

df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057

EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame

nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431

How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.

Can you check whether there are duplicates for 'col1', and 'col2' in each of the datasets? If that is the case, you may create a sequence index, grouped by 'col1' and 'col2' in each of the dataset and merge by 'col1', 'col2', and the 'indx'. Another option would be using `match`. But, without a small example, it is difficult to test — akrun, May 23 '15 at 08:41
@akrun, thanks akrun. I have made an edit above. please see. How can I take only row for the col1,col2 combination so as to end up with as many rows as in the first data frame upon merge? — tubby, May 23 '15 at 20:25
You may need to create a sequence index as I commented earlier. Without an example data, it is not easy to test. Please show some reproducible example — akrun, May 24 '15 at 04:13
You will need to create a sequence to index this as akrun has mentioned. — user3919708, Jul 23 '15 at 12:47
I think the `indx` will not solve the problem. I think it is necessary to decide on what to do with repeated cases in `df2`, i.e, take the first, last, average, etc. in order to have just the `df1` with extra columns from `df2`. — Robert, Jul 19 '16 at 22:22

snaut · Answer 1 · 2022-11-03T16:23:44.050

2

This should work, be sure to sort df2 first so you select the right rows.

df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")]), ],
  by=c("col1","col2"),
  all.x=TRUE
)

What happens here: I merge the two data frames by the columns we want to merge by, but I first select only the first occurrence of any combination of col1 and col2 from the second data.frame df2.

duplicated checks if lines are duplicated if called with a data.frame. I select col1 and col2 from df2, so duplicated returns TRUE for rows with the same col1 and col2 but differences in other cols. Then I select only the rows which are not duplicated.

(Read the [-expressions carefully, and check the function calls from the inside out, to get the intermediate results)

edit: added explanation as suggested in comments

edited Nov 03 '22 at 16:23

answered Jul 07 '17 at 08:51

snaut

2,261
18
37

Can you also please add an explanation of the code? – PM0087 Jun 04 '20 at 09:47
When I copy the solution code to R Studio, I get four unexpected / unmatched opening and closing bracket complaints, even though the code appears correct in this respect. Anybody know why? – W Barker Apr 03 '21 at 11:56
1

The parenthesis in "!duplicated(" appears to be unmatched. What worked for me: df = merge( df1, df2[!duplicated(df2[, c("col1","col2")]), ], by=c("col1","col2"), all.x=TRUE ) – Stephen Nov 02 '22 at 15:06
Thank you, I corrected the answer accordingly – snaut Nov 03 '22 at 16:23

merge results in more rows than one of the data frames

1 Answers1

Linked