I want to keep the column (="person" in the dataframe) after performing a join in a data.table. I was able to get something close to the desired output but it required switching between data.table and dplyr, because of my limited experience with data.table :
Here the dataframe :
df<-structure(list(person = c("p1", "p1", "p1", "p1", "p1", "p1",
"p1", "p2", "p2", "p2", "p3", "p3", "p3", "p4", "p4", "p4", "p5",
"p5", "p5", "p6", "p6", "p6", "p7", "p7", "p7"), hp_char = c("hp1",
"hp2", "hp3", "hp4", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10",
"hp1", "hp2", "hp3", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10",
"hp3", "hp4", "hp5", "hp1", "hp2", "hp3")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("person",
"hp_char"), spec = structure(list(cols = structure(list(person = structure(list(), class = c("collector_character",
"collector")), hp_char = structure(list(), class = c("collector_character",
"collector"))), .Names = c("person", "hp_char")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
I am doing a self-join to get the number of instances of co-occurence of any two "hp_id" as follows (similar to what is elaborated in this question). I am keeping the "person" in by=.(...)
to see who were involved the cooccurence combinations (e.g. hp1 and hp2 co-ocurred in individuals p1,p3 and p7) :
df_by2<- setDT(df)[df, on = "person", allow = TRUE][
hp_char < i.hp_char, .N, by = .(person ,HP_ID1 = hp_char, HP_ID2 = i.hp_char)]
However because of including "person" in by =.(person,...
, the count (= N) is separated according to combination of "person","hp_id" and "hp_id2". So I switched to dplyr to get close to what I want as follows.
dfx<- df_by2 %>% group_by(HP_ID1,HP_ID2) %>% mutate (counts=length(person)) %>% spread(person,person) %>% select (-N) %>% unique() %>% filter(counts>1) %>% unite(person,p1:p7, sep="") %>% mutate (involved_id=gsub('?NA', ' ', person)) %>% select (-person)
This is the output I get:
# A tibble: 12 x 4
HP_ID1 HP_ID2 counts involved_id
<chr> <chr> <int> <chr>
1 hp1 hp2 3 p1 p3 p7
2 hp1 hp3 3 p1 p3 p7
3 hp10 hp8 2 p2 p5
4 hp10 hp9 2 p2 p5
5 hp2 hp3 3 p1 p3 p7
6 hp3 hp4 2 p1 p6
7 hp3 hp5 2 p1 p6
8 hp4 hp5 2 p1 p6
9 hp5 hp6 2 p1 p4
10 hp5 hp7 2 p1 p4
11 hp6 hp7 2 p1 p4
12 hp8 hp9 2 p2 p5
This is close but the desired output (with properly formatted albeit untidy "involved_id" column) is:
# A tibble: 12 x 4
HP_ID1 HP_ID2 counts involved_id
<chr> <chr> <int> <chr>
1 hp1 hp2 3 p1, p3, p7
2 hp1 hp3 3 p1, p3, p7
3 hp10 hp8 2 p2, p5
4 hp10 hp9 2 p2, p5
5 hp2 hp3 3 p1, p3, p7
6 hp3 hp4 2 p1, p6
7 hp3 hp5 2 p1, p6
8 hp4 hp5 2 p1, p6
9 hp5 hp6 2 p1, p4
10 hp5 hp7 2 p1, p4
11 hp6 hp7 2 p1, p4
12 hp8 hp9 2 p2, p5
All of this is very cumbersome and I was wondering if there is a simpler approach to this. I have just recently came across data.table and enjoying learning it. Any help using data.table is highly appreciated.