4

I have a dataframe with binary values like so:

df<-data.frame(a=rep(c(1,0),9),b=rep(c(0,1,0),6),c=rep(c(0,1),9))

Purpose is to first obtain all pairwise combinations :

combos <- function(df, n) {
  unlist(lapply(n, function(x) combn(df, x, simplify=F)), recursive=F)
} 

combos(df,2)->j

Next I want to get the proportion of pairs for which both columns in each dataframe in list j has either (0,0) or (1,1). I can get the proportions like so:

lapply(j, function(x) data.frame(new = rowSums(x[,1:2])))->k
lapply(k, function(x) data.frame(prop1 = length(which(x==1))/18,prop2=length(which(x==0|x==2))/18))

However this seems slow and complicated for larger lists. Couple of questions: 1) Is there a faster/better method than this? My actual list is 20 dataframes each with dim : 250 x 400. I tried dist(df,method=binary)but it looks like the binary method doesnot take into account (0,0) instances.

2) Also why when I try to divide using length(x[1]) or lengths(x[1]) it does not give me 18? In the example I divided it by specifying the length of vector new.

Any help is very much appreciated!

thisisrg
  • 596
  • 3
  • 12

1 Answers1

4
#Get the combinations
j = combn(x = df, m = 2, simplify = FALSE)

#Get the Proportions
sapply(j, function(x) length(which(x[1] == x[2]))/NROW(x))

As @thelatemail commented, if you are not concerned with storing the intermediate combinations, you can just do at once using

combn(x = df, m = 2, FUN=function(x) length(which(x[1] == x[2]))/NROW(x))
d.b
  • 32,245
  • 6
  • 36
  • 77