Select only the first row when merging data frames with multiple matches

Question

I have two data frames, "data" and "scores", and want to merge them on the "id" column:

data = data.frame(id = c(1,2,3,4,5),
                  state = c("KS","MN","AL","FL","CA"))
scores = data.frame(id = c(1,1,1,2,2,3,3,3),
                    score = c(66,75,78,86,85,76,75,90))
merge(data, scores, by = "id")                  
semi_join(data, scores, by = "id")

In the "scores" data, there are "id" with multiple observations, where each match gets a row following the join. See ?merge:

If there is more than one match, all possible matches contribute one row each.

However, I want keep only the row corresponding to the first match from the scores table.

A semi join would have been nice, but I'm not able to select the score from the right table.

Any suggestions?

score 25 · Accepted Answer · edited Aug 23 '17 at 18:47

25

Using data.table along with mult = "first" and nomatch = 0L:

require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference

scores[data, mult = "first", on = "id", nomatch=0L]
#    id score state
# 1:  1    66    KS
# 2:  2    86    MN
# 3:  3    76    AL

For each row on data's id column, the matching rows in scores' id column are found, and the first one alone is retained (because mult = "first"). If there are no matches, they're removed (because of nomatch = 0L).

edited Aug 23 '17 at 18:47

Henrik

65,555
14
143
159

answered Jun 10 '16 at 13:29

Arun

116,683
26
284
387

How about using a keyed `scores` data.table: `setDT(data); scores <- data.table(scores, key = "id"); unique(scores)[data, nomatch = 0L]` – mtoto Jun 10 '16 at 13:34
4

It'd work fine, but is just very inefficient. `data.table()` results in a deep copy. The key argument would reorder the entire data.table (inefficient unless case can be made for reuse and probably undesirable). `unique()` results in unnecessary intermediate data. And `on=` also allows for a clean syntax to just look at it and understand what the join column is (although it can be used on keyed data.tables as well). I recommend reading the [secondary indices](https://github.com/Rdatatable/data.table/wiki/Getting-started) vignette. – Arun Jun 10 '16 at 13:39
Hi @Arun is it possible to tell `mult` to take the highest value according to another column instead of the first ? – Orhan Yazar Jul 29 '19 at 08:20
@OrhanYazar not with rolling joins, but possibly with non-equi joins. Post a Q and tag it here maybe? I (or other so many able people here on SO) could take a look. – Arun Sep 26 '19 at 18:05
The mult = "first" option doesn't work if you change the order of the datatables data[scores, mult = "first", on = "id", nomatch=0L] – skan May 07 '21 at 22:31

lmo · Answer 2 · 2017-04-07T14:35:08.480

10

Here is a base R method using aggregate and head:

merge(data, aggregate(score ~ id, data=scores, head, 1), by="id")

The aggregate function breaks up the scores dataframe by id, then head is applied to get the first observation from each id. Since aggregate returns a data.frame, this is directly merged onto the data.frame data.

Probably more efficient is to subset the scores data.frame using duplicated which will achieve the same result as aggregate, but will reduce the computational overhead.

merge(data, scores[!duplicated(scores$id),], by="id")

edited Apr 07 '17 at 14:35

answered Jun 10 '16 at 13:26

lmo

37,904
9
56
69

This solution is good, but it is inefficient. I would use dplyr::distinct to replace aggregate. – Huanfa Chen Apr 07 '17 at 13:10
Thanks @huanfa-chen. Your suggestion gave me the idea to use `duplicated` in place of `aggregate + head` which will certainly be more efficient. – lmo Apr 07 '17 at 14:37

score 9 · Answer 3 · answered Apr 07 '17 at 13:16

Here is another method using dplyr::distinct. It is useful if you want to keep all rows from 'data' even if there is no match.

data = data.frame(id=c(1,2,3,4,5),
                  state=c("KS","MN","AL","FL","CA"))
scores = data.frame(id=c(1,1,1,2,2,3,3,3),
                    score=c(66,75,78,86,85,76,75,90))
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL    NA
# 5  5    CA    NA

Moreover, if you want to replace the NAs in the new data.frame, try the tidyr::replace_na() function. Example:

data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L))
# Joining, by = "id"
# id state score
# 1  1    KS    66
# 2  2    MN    86
# 3  3    AL    76
# 4  4    FL     0
# 5  5    CA     0

score 1 · Answer 4 · answered May 29 '19 at 08:59

In base you can use match to Select only the first row when merging data frames with multiple matches.

#Return also those which found no match
(tt <- cbind(data, score=scores[match(data$id, scores$id),"score"]))
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76
#4  4    FL    NA
#5  5    CA    NA

#Return only those which found a match
tt[!is.na(tt$score),]
#  id state score
#1  1    KS    66
#2  2    MN    86
#3  3    AL    76

Select only the first row when merging data frames with multiple matches

4 Answers4

Linked

Related