7

I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.

However, I get more than 9994 rows upon merge. How can I make sure this does not happen?

df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994

df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431

df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057

df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057

EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame

nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431

How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.

user438383
  • 5,716
  • 8
  • 28
  • 43
tubby
  • 2,074
  • 3
  • 33
  • 55
  • 2
    Can you check whether there are duplicates for 'col1', and 'col2' in each of the datasets? If that is the case, you may create a sequence index, grouped by 'col1' and 'col2' in each of the dataset and merge by 'col1', 'col2', and the 'indx'. Another option would be using `match`. But, without a small example, it is difficult to test – akrun May 23 '15 at 08:41
  • @akrun, thanks akrun. I have made an edit above. please see. How can I take only row for the col1,col2 combination so as to end up with as many rows as in the first data frame upon merge? – tubby May 23 '15 at 20:25
  • 3
    You may need to create a sequence index as I commented earlier. Without an example data, it is not easy to test. Please show some reproducible example – akrun May 24 '15 at 04:13
  • You will need to create a sequence to index this as akrun has mentioned. – user3919708 Jul 23 '15 at 12:47
  • 1
    I think the `indx` will not solve the problem. I think it is necessary to decide on what to do with repeated cases in `df2`, i.e, take the first, last, average, etc. in order to have just the `df1` with extra columns from `df2`. – Robert Jul 19 '16 at 22:22

1 Answers1

2

This should work, be sure to sort df2 first so you select the right rows.

df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")]), ],
  by=c("col1","col2"),
  all.x=TRUE
)

What happens here: I merge the two data frames by the columns we want to merge by, but I first select only the first occurrence of any combination of col1 and col2 from the second data.frame df2.

duplicated checks if lines are duplicated if called with a data.frame. I select col1 and col2 from df2, so duplicated returns TRUE for rows with the same col1 and col2 but differences in other cols. Then I select only the rows which are not duplicated.

(Read the [-expressions carefully, and check the function calls from the inside out, to get the intermediate results)

edit: added explanation as suggested in comments

snaut
  • 2,261
  • 18
  • 37
  • Can you also please add an explanation of the code? – PM0087 Jun 04 '20 at 09:47
  • When I copy the solution code to R Studio, I get four unexpected / unmatched opening and closing bracket complaints, even though the code appears correct in this respect. Anybody know why? – W Barker Apr 03 '21 at 11:56
  • 1
    The parenthesis in "!duplicated(" appears to be unmatched. What worked for me: df = merge( df1, df2[!duplicated(df2[, c("col1","col2")]), ], by=c("col1","col2"), all.x=TRUE ) – Stephen Nov 02 '22 at 15:06
  • Thank you, I corrected the answer accordingly – snaut Nov 03 '22 at 16:23