4

So the purpose is to compare each ID with each other ID by taking distances. Also some IDs may be related by belonging to the same group, this means it is not necessary to compare them if they are related.

Consider the following dataframe Df

ID AN     AW      Group
a  white  green   1
b  black  yellow  1
c  purple gray    2
d  white  gray    2

The following code helps in achieving this result (from question: R Generate non repeating pairs in dataframe):

ids <- combn(unique(df$ID), 2)
data.frame(df[match(ids[1,], df$ID), ], df[match(ids[2,], df$ID), ])

#ID   AN     AW    ID2   AN2    AW2
a   white  green   b   black  yellow
a   white  green   c   purple gray
a   white  green   d   white  gray
b   black  yellow  c   purple gray 
b   black  yellow  d   white  gray
c   purple gray    d   white  gray

I want to know if it is possible to not compute certain computations in order to have this answer:

#ID   AN     AW    Group   ID2   AN2    AW2   Group2
a   white  green     1      c   purple gray    2
a   white  green     1      d   white  gray    2
b   black  yellow    1      c   purple gray    2
b   black  yellow    1      d   white  gray    2

Meaning I can avoid this computations:

#ID   AN     AW    Group   ID2   AN2    AW2    Group2
a   white  green     1      b   black  yellow    1
c   purple gray      2      d   white  gray      2

I am able to subset if I compare groups, but that means more computing time since the data frame is big, and the combinations follow n*(n-1)/2

Is this possible? Or do I have to make all combinations and then subset the comparisons between the same group out?

Community
  • 1
  • 1
Saul Garcia
  • 890
  • 2
  • 9
  • 22

2 Answers2

1

Here is a fairly lengthy base R solution that assumes that there may be more than two groups.

# create test data.frame
df <- data.frame(ID=letters[1:4], AN=c("white", "black", "purple", "white"),
                 AW=c("green", "yellow", "gray", "gray"),
                 Group=rep(c(1,2),each=2), stringsAsFactors=FALSE)

# split data.frame by group, subset df to needed variables
dfList <- split(df[, c("ID", "Group")], df$Group)
# use combn to get all group-pair combinations
groupPairs <- combn(unique(df$Group), 2)

Next, we loop through (via sapply) all pairwise combinations of groups. For each combination, we build a data.frame that is the pairwise combination of IDs between each group via expand.grid. The IDs are extracted (with the [[]] operator) from the named list, dfList using their names from groupPairs[1,i] and groupPairs[2,i].

# get a list of all ID combinations by group combination
myComparisonList <- sapply(1:ncol(groupPairs), function(i) {
                           expand.grid(dfList[[groupPairs[1,i]]]$ID,
                                       dfList[[groupPairs[2,i]]]$ID,
                                       stringsAsFactors=F)
                           })
# extract list of combinations to matrix
idsMat <- sapply(myComparisonList, rbind)

# bind comparison pairs together by column
dfDone <- cbind(df[match(idsMat[,1], df$ID), ], df[match(idsMat[,2], df$ID), ])
# differentiate names
names(dfDone) <- paste0(names(dfDone), rep(c(".1", ".2"),
                        each=length(names(df))))
lmo
  • 37,904
  • 9
  • 56
  • 69
  • Indeed I have more than two groups, I am trying to understand the code, but if I run it from `'myComparisonList` It is throwing me this error: `Error: unexpected ')' in: " dfList[[groupPairs[2,i]]]$ID, stringsAsFactors=F))"` – Saul Garcia May 08 '16 at 15:02
  • This works!!!! I did not quite understood the part of myComparisonList.. could you clarify this? But really this answer helped me a lot! – Saul Garcia May 08 '16 at 15:31
  • @SaulGarcia Hopefully the additional info in my answer is helpful. – lmo May 08 '16 at 15:41
  • I appreciate it! Thank you – Saul Garcia May 08 '16 at 20:25
0

if you can use sql to do this then where g is for the group.

sqldf("select * from f t1 inner join f t2 on t1.g!=t2.g")
Chirayu Chamoli
  • 2,076
  • 1
  • 17
  • 32