0

My question is similar to this r count combinations of elements in groups however, firstly, I want to group all potential combinations by group in a column Comb and second, count the occurrences of the combinations depending on year in a column n.

Using the same mock dataset:

> dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))
> dat
   group id year
1:     1 10 2010
2:     1 11 2010
3:     1 12 2010 
4:     2 10 2011 
5:     2 11 2011 
6:     2 13 2011
7:     3 11 2012
8:     3 13 2012

The desired outcome:

   > dat
           group Comb   year  n
        1:     1 10 11  2010  1
        2:     1 11 12  2010  1
        3:     1 12 10  2010  1
        4:     2 10 11  2011  2
        5:     2 11 13  2011  1
        6:     2 13 10  2011  1
        7:     3 11 13  2012  2

I would much appreciate a possible solution with dplyr.

thanks

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Amleto
  • 584
  • 1
  • 7
  • 25
  • 3
    I do not understand the `n` column in your output. Nor your combinations. In your output, row 4 has combination `10 11`, and row 5 has combination `11 10`. Why does order matter here, but not before? Why is there `11 10` for group 2, but not for group 1? And why are those `n` values `2`? – Gregor Thomas Mar 27 '19 at 20:01
  • may I ask why did you use data.table and not tibble if you are looking for a dplyr solution? maybe it will be useful to add the correct tag depending on your requirements – chinsoon12 Mar 28 '19 at 02:59
  • I have changed the desired outcome based on @Gregor comment. The order does not matter, what matter is to get all potential combinations of **id** for each **group**. Then, numbering the occurrence based n **year** and **group**. I hope this helps. – Amleto Mar 29 '19 at 09:17
  • I still do not understand `n`. If we look at your input, in `year = 2010` and `group = 1`, there is are 3 ids, all unique, `10`, `11`, and `12`. The desired outcome for `group = 1, year = 2010` has 3 rows, one for each `Comb`, and the `n` is one for all of them. This makes sense to me. However, the input of `group = 2` and `year = 2011` looks the same: 3 rows, 3 unique ids. But in the output, the `Combs = 10 11` has `n = 2`. Why is `n = 2` in output row 4, but `n = 1` for output row 1??? – Gregor Thomas Mar 29 '19 at 12:02
  • Because it is the second appearance of the combination `10 11`. similar in `row 7` comb `11 13` `n=2` cause the same combination appeared in `row5`. – Amleto Mar 29 '19 at 12:52

1 Answers1

1

Here's a solution, presented first as data.table then as dplyr. The process is the same: we self-join on group, filter where the id combinations are in a consistent order (any order would work, we pick first id < second id), group by combination to number the rows, and drop the unused columns.

dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13))

## with data.table
merge(dat, dat, by = "group", allow.cartesian = TRUE)[
  id.x < id.y, ][
    , Comb := paste(id.x, id.y)][
      , n := 1:.N, by = .(Comb)
    ][, .(group, Comb, n)]
#    group  Comb n
# 1:     1 10 11 1
# 2:     1 10 12 1
# 3:     1 11 12 1
# 4:     2 10 11 2
# 5:     2 10 13 1
# 6:     2 11 13 1
# 7:     3 11 13 2

## with dplyr
dat %>% full_join(dat, by = "group") %>%
  filter(id.x < id.y) %>%
  group_by(Comb = paste(id.x, id.y)) %>%
  mutate(n = row_number()) %>%
  select(group, Comb, n)
# # A tibble: 7 x 3
# # Groups:   Comb [5]
#   group Comb      n
#   <dbl> <chr> <int>
# 1     1 10 11     1
# 2     1 10 12     1
# 3     1 11 12     1
# 4     2 10 11     2
# 5     2 10 13     1
# 6     2 11 13     1
# 7     3 11 13     2
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294