Count number of shared observations between samples using dplyr

Question

I have a list of observations grouped by samples. I want to find the samples that share the most identical observations. An identical observation is where the start and end number are both matching between two samples. I'd like to use R and preferably dplyr to do this if possible. I've been getting used to using dplyr for simpler data handling but this task is beyond what I am currently able to do. I've been thinking the solution would involve grouping the start and end into a single variable: group_by(start,end) but I also need to keep the information about which sample each observation belongs to and compare between samples.

example:

sample  start   end
a   2   4
a   3   6
a   4   8
b   2   4
b   3   6
b   10  12
c   10  12
c   0   4
c   2   4

Here samples a, b and c share 1 observation (2, 4) sample a and b share 2 observations (2 4, 3 6) sample b and c share 2 observations (2 4, 10 12) sample a and c share 1 observation (2 4)

I'd like an output like:

abc 1
ab 2
bc 2
ac 1

and also to see what the shared observations are if possible:

abc 2 4
ab 2 4 
ab 3 6

etc

Thanks in advance

score 1 · Accepted Answer · answered Jun 07 '17 at 13:37

Here's something that should get you going:

df %>% 
  group_by(start, end) %>% 
  summarise(
    samples = paste(unique(sample), collapse = ""), 
    n = length(unique(sample)))

# Source: local data frame [5 x 4]
# Groups: start [?]
# 
#   start   end samples     n
#   <int> <int>   <chr> <int>
# 1     0     4       c     1
# 2     2     4     abc     3
# 3     3     6      ab     2
# 4     4     8       a     1
# 5    10    12      bc     2

score 1 · Answer 2 · answered Jun 07 '17 at 13:59

Here is an idea via base R,

final_d <- data.frame(count1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), nrow), 
                      pairs1 = sapply(Filter(nrow, split(df, list(df$start, df$end))), function(i) paste(i[[1]], collapse = '')))

#      count1 pairs1
#0.4        1      c
#2.4        3    abc
#3.6        2     ab
#4.8        1      a
#10.12      2     bc

Count number of shared observations between samples using dplyr

2 Answers2