I have a example data set
gene_name | motif_id | matched_sequence |
---|---|---|
A | y1 | CCC |
A | y2 | CCAAA |
A | y3 | AAG |
A | y3 | AT |
B | y1 | AAAA |
B | y4 | AAT |
C | y5 | AAGG |
and trying to get dataset like in R :
gene_name | Node1 | Node2 | sequence | occurence |
---|---|---|---|---|
A | y1 | y2 | CCC, CCAAA | 2 |
A | y1 | y3 | CCC,AAG,AAT | 3 |
A | y2 | y3 | CCAAA,AGG,AAT | 3 |
B | y1 | y4 | AAAA,AAT | 2 |
motif_id column alway has a target and looking for common gene_name from each combination of start column without any overlaps and its list of sequence.
I have tried :
data%>%
group_by(gene_name, motif_id) %>%
summarize(matched_sequence = paste0(matched_sequence, collapse = ",")) %>%
mutate(count = n()) %>% filter(count>=2) %>%
summarize(motif_id = combn(motif_id, 2, function(x) list(setNames(x, c('Node1', 'Node2')))), matched_sequence = toString(matched_sequence),
.groups = 'keep') %>%
tidyr::unnest_wider(motif_id)
however failed to acquire sequence and occurence columns. Can anyone give me an advise?