join in data.table keeping column ids

Question

I want to keep the column (="person" in the dataframe) after performing a join in a data.table. I was able to get something close to the desired output but it required switching between data.table and dplyr, because of my limited experience with data.table :

Here the dataframe :

df<-structure(list(person = c("p1", "p1", "p1", "p1", "p1", "p1", 
"p1", "p2", "p2", "p2", "p3", "p3", "p3", "p4", "p4", "p4", "p5", 
"p5", "p5", "p6", "p6", "p6", "p7", "p7", "p7"), hp_char = c("hp1", 
"hp2", "hp3", "hp4", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10", 
"hp1", "hp2", "hp3", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10", 
"hp3", "hp4", "hp5", "hp1", "hp2", "hp3")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("person", 
"hp_char"), spec = structure(list(cols = structure(list(person = structure(list(), class = c("collector_character", 
"collector")), hp_char = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("person", "hp_char")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

I am doing a self-join to get the number of instances of co-occurence of any two "hp_id" as follows (similar to what is elaborated in this question). I am keeping the "person" in by=.(...) to see who were involved the cooccurence combinations (e.g. hp1 and hp2 co-ocurred in individuals p1,p3 and p7) :

df_by2<- setDT(df)[df, on = "person", allow = TRUE][
    hp_char < i.hp_char, .N, by = .(person ,HP_ID1 = hp_char, HP_ID2 = i.hp_char)]

However because of including "person" in by =.(person,..., the count (= N) is separated according to combination of "person","hp_id" and "hp_id2". So I switched to dplyr to get close to what I want as follows.

dfx<- df_by2 %>% group_by(HP_ID1,HP_ID2) %>% mutate (counts=length(person)) %>% spread(person,person) %>% select (-N) %>% unique() %>% filter(counts>1) %>% unite(person,p1:p7, sep="") %>% mutate (involved_id=gsub('?NA', ' ', person)) %>% select (-person)

This is the output I get:

# A tibble: 12 x 4
   HP_ID1 HP_ID2 counts   involved_id
    <chr>  <chr>  <int>      <chr>
 1    hp1    hp2      3 p1 p3   p7
 2    hp1    hp3      3 p1 p3   p7
 3   hp10    hp8      2   p2  p5  
 4   hp10    hp9      2   p2  p5  
 5    hp2    hp3      3 p1 p3   p7
 6    hp3    hp4      2  p1    p6 
 7    hp3    hp5      2  p1    p6 
 8    hp4    hp5      2  p1    p6 
 9    hp5    hp6      2  p1  p4   
10    hp5    hp7      2  p1  p4   
11    hp6    hp7      2  p1  p4   
12    hp8    hp9      2   p2  p5

This is close but the desired output (with properly formatted albeit untidy "involved_id" column) is:

# A tibble: 12 x 4
   HP_ID1 HP_ID2 counts   involved_id
    <chr>  <chr>  <int>      <chr>
 1    hp1    hp2      3 p1, p3, p7
 2    hp1    hp3      3 p1, p3, p7
 3   hp10    hp8      2     p2, p5
 4   hp10    hp9      2     p2, p5
 5    hp2    hp3      3 p1, p3, p7
 6    hp3    hp4      2     p1, p6
 7    hp3    hp5      2     p1, p6
 8    hp4    hp5      2     p1, p6
 9    hp5    hp6      2     p1, p4
10    hp5    hp7      2     p1, p4
11    hp6    hp7      2     p1, p4
12    hp8    hp9      2     p2, p5

All of this is very cumbersome and I was wondering if there is a simpler approach to this. I have just recently came across data.table and enjoying learning it. Any help using data.table is highly appreciated.

Is you comparison `hp_char < i.hp_char` working as you expect, given `"hp10" < "hp9" == TRUE` ? — SymbolixAU, Aug 27 '18 at 00:44
Hi. Please make questions self-contained, include whatever you need from the link, with credit. Especially don't make us go to a *third* question. — philipxy, Aug 27 '18 at 01:10
@SymbolixAU Sorry I missed what you meant. Can you please clarify a bit?Looks fine to me as far as the output is concerned with/ without "person" removed from .by(..). — thisisrg, Aug 27 '18 at 17:44
@SymbolixAU those are discrete characters. If I understood you corectly, there is no relation between any of the hp_chars'. — thisisrg, Aug 27 '18 at 21:07
but aren't you doing exactly that, with the `hp_char < i.hp_char` comparison? — SymbolixAU, Aug 27 '18 at 21:45
Hi. You still haven't addressed my comments. You don't explain "co-occurrence" or "co-occurrence combination" (or "separated according to") or even what result you want. Use enough sentences to say what you mean. Read what you wrote--it does not explain. I can now guess from your "e.g." what you might mean by saying two given values in a given column co-occur for a given value in a given other column. But you don't say what you mean by "co-occur" or "co-occurrence" or "co-occurrence combination". Forcing yourself to be clear not only allows communication but allows you to reason clearly. — philipxy, Aug 27 '18 at 22:02
What do you mean by "those are discrete characters" & "there is no relation between any of the hp_chars"? Also: Please read & act on [mcve]. Part of doing that includes justifying "minimal code with a problem for required functionality" by finding minimal code with no problem for maximal partial functionality. — philipxy, Aug 27 '18 at 22:18
@SymbolixAU I see what you are asking. My understanding is that hp_char < i.hp_char is a way to get the get the rows of the dataframe where hp_char ! = i.hp.char. Am I missing something? — thisisrg, Aug 27 '18 at 23:00
I think it's incorrect to do this to compare the strings. For example, you'll miss `"hp9" < "hp10"` — SymbolixAU, Aug 27 '18 at 23:13
Thanks! That is correct. However what I wanted was e.g. once I got HP_ID1="hp10" and HP_ID2= "hp9" , I do not need the reverse HP_ID1= "hp9" and HP_ID2="hp10". They would mean the same thing for the purposes of the problem. I agree that in general hp_char != i.hp_char may be better . It will not miss the instances you mention. — thisisrg, Aug 27 '18 at 23:32

chinsoon12 · Accepted Answer · 2018-08-28T00:17:26.513

Continuing from the answer posted earlier here (also copied here for convenience), use .(.N, involved_id=paste(x.person, collapse=", ")) for the final desired output instead:

library(data.table)
setDT(df)

nset <- 3
cols <- paste0("hp_char", seq_len(nset))

#create combinations of nset number of skills
combi <- do.call(CJ, rep(df[,.(unique(hp_char))], nset))
setnames(combi, cols)

#create for each person the combinations of nset number of skills
nsetSkills <- df[, do.call(CJ, rep(.(hp_char), nset)), by=.(person)]
setnames(nsetSkills, names(nsetSkills)[-1L], cols)

ans <- nsetSkills[combi, on=cols, 
    .(.N, involved_id=paste(x.person, collapse=", ")), by=.EACHI]
ans

output:

      hp_char1 hp_char2 hp_char3 N involved_id
   1:      hp1      hp1      hp1 3  p1, p3, p7
   2:      hp1      hp1     hp10 0          NA
   3:      hp1      hp1      hp2 3  p1, p3, p7
   4:      hp1      hp1      hp3 3  p1, p3, p7
   5:      hp1      hp1      hp4 1          p1
  ---                                         
 996:      hp9      hp9      hp5 0          NA
 997:      hp9      hp9      hp6 0          NA
 998:      hp9      hp9      hp7 0          NA
 999:      hp9      hp9      hp8 2      p2, p5
1000:      hp9      hp9      hp9 2      p2, p5

Posts should be self-contained. Please edit this to be a self-contained answer to the question. — philipxy, Aug 27 '18 at 19:01
Thanks @chinsoon12..solution is relatively fast for a larger dataframe. Also clear answer.. helped my understanding of data.table a bit more. — thisisrg, Aug 30 '18 at 15:16

Maurits Evers · Answer 2 · 2018-08-26T23:10:34.173

1

Perhaps you're interested in an "all tidyverse" approach (self-join using combn plus summarisation)?

df %>%
    group_by(person) %>%
    summarise(tmp = list(setNames(
        as_tibble(t(combn(hp_char, 2))),
        c("HP_ID1", "HP_ID2")))) %>%
    unnest() %>%
    group_by(HP_ID1, HP_ID2) %>%
    summarise(
        counts = n(),
        involved_id = toString(person)) %>%
    filter(counts > 1)
## A tibble: 12 x 4
## Groups:   HP_ID1 [8]
#   HP_ID1 HP_ID2 counts involved_id
#   <chr>  <chr>   <int> <chr>
# 1 hp1    hp2         3 p1, p3, p7
# 2 hp1    hp3         3 p1, p3, p7
# 3 hp2    hp3         3 p1, p3, p7
# 4 hp3    hp4         2 p1, p6
# 5 hp3    hp5         2 p1, p6
# 6 hp4    hp5         2 p1, p6
# 7 hp5    hp6         2 p1, p4
# 8 hp5    hp7         2 p1, p4
# 9 hp6    hp7         2 p1, p4
#10 hp8    hp10        2 p2, p5
#11 hp8    hp9         2 p2, p5
#12 hp9    hp10        2 p2, p5

edited Aug 26 '18 at 23:10

answered Aug 26 '18 at 23:04

Maurits Evers

49,617
4
47
68

Thanks!..although with combn() the large dataset that I am working with takes a long time (worsens with > 2 co-occuring events). But this works perfect with smaller dataframes.. I hybridized this with the data.table part. If I do not get a data.table answer I will hit the checkmark :) . – thisisrg Aug 26 '18 at 23:42
@thisisrg I see; in that case I would use the optimised methods from [`arrangements`](https://cran.r-project.org/web/packages/arrangements/index.html) to get all pairwise combinations. The speed increase should be significant. If you could provide a larger representative dataset, I'd be interested in running a benchmark comparison. – Maurits Evers Aug 27 '18 at 01:42
@ Maurits Evers So the equivalent dataset to what I posted would be: 110000 rows with 4000 unique hp_chars(hp1...hp4000) and 11000 unique person (p1,..,p11000). – thisisrg Aug 27 '18 at 17:30

join in data.table keeping column ids

2 Answers2