1

I have some code that I have inherited that generates a matrix of significance levels for pairwise comparisons from predicted means. Since the model includes data from multiple sites and treatments, but I only want to compare between genotypes within a treatment within a site, only a subset of the comparisons are meaningful.

Here's a dummy version of what is currently generated.

effect.nam <- expand.grid(site=c("A","B","C"), treat=c("low","high"), genotype=c("A1","B2"))
labels <-  paste(effect.nam[,1],effect.nam[,2],effect.nam[,3], sep=".")
mat <-matrix(sample(c(T,F),144,replace=T),12,12)
dimnames(mat) <- list(labels,labels)

Obviously in this situation the T/F are random. What I would like is to only see the the comparisons within a sites and treatment. It would be nice to also remove the self comparison. Ideally I want to return a dataframe in the form:

   Site Treat Genotype1 Genotype2   Sig
1     A   low         A1         2  TRUE
2     A   low         A1         3  TRUE
3     A   low         B2         3  TRUE
4     A  high         A1         2  TRUE
5     A  high         A1         3 FALSE
6     A  high         B2         3 FALSE
7     B   low         A1         2 FALSE
8     B   low         A1         3  TRUE
9     B   low         B2         3 FALSE
10    B  high         A1         2  TRUE
11    B  high         A1         3  TRUE
12    B  high         B2         3  TRUE
13    C   low         A1         2  TRUE
14    C   low         A1         3  TRUE
15    C   low         B2         3 FALSE
16    C  high         A1         2  TRUE
17    C  high         B1         3  TRUE
18    C  high         A2         3  TRUE

I've made a few false starts, and if anyone had some quick pointers in the right direction it would be appreciated.

In the very useful answer Chase gave below, you can see that while the meaningless comparisons have been removed, each useful comparison is contained twice (genotype 1 vs genotype 2 and vice versa). I can't see how to easily remove these, since they're not really duplicates...

--Update--

My apologies, I needed to alter mat so that when Chase's solution is implemented, Genotype1 and Genotype2 are factor, not int, as is the case in my real situation. I've made a couple of additions to the solution below, given here (adding a sorting column to avoid doubling up comparisons).

It works, which is great, but adding those columns seems awkward to me - is there a more elegant way?

mat.m <- melt(mat)
mat.m[,c("site1", "treat1", "genotype1")] <-  colsplit(mat.m$X1, "\\.", c("site1", "treat1", "genotype1"))
mat.m[,c("site2", "treat2", "genotype2")] <-  colsplit(mat.m$X2, "\\.", c("site2", "treat2", "genotype2"))
str(mat.m)
mat.m$genotype1sort <- mat.m$genotype1
mat.m$genotype2sort <- mat.m$genotype2
levels(mat.m$genotype1sort) <- c(1, 2)
levels(mat.m$genotype2sort) <- c(1, 2)
mat.m$genotype1sort <- as.numeric(levels(mat.m$genotype1sort))[mat.m$genotype1sort]
mat.m$genotype2sort <- as.numeric(levels(mat.m$genotype2sort))[mat.m$genotype2sort]

subset(mat.m, site1 == site2 & treat1 == treat2 & genotype1 != genotype2 & genotype1sort < genotype2sort,
   select = c("site1", "treat1", "genotype1", "genotype2", "value"))

#-----
    site1 treat1 genotype1 genotype2 value
73      A    low        A1        B2  TRUE
86      B    low        A1        B2  TRUE
99      C    low        A1        B2  TRUE
112     A   high        A1        B2  TRUE
125     B   high        A1        B2 FALSE
138     C   high        A1        B2 FALSE
alexwhan
  • 15,636
  • 5
  • 52
  • 66
  • If performance becomes an issue, I'd look into only generating the comparisons you need instead of doing lots of comparisons that aren't useful and whittling that list down post facto. – Chase May 08 '12 at 04:44

1 Answers1

1

I think this gets what you want using a few functions from reshape2. First, melt the data into long format:

require(reshape2)
mat.m <- melt(mat)
#--------
        X1      X2 value
1  A.low.1 A.low.1  TRUE
2  B.low.1 A.low.1  TRUE

Next, split the columns X1 and X2 into three columns on the "."

mat.m[,c("site1", "treat1", "genotype1")] <-  colsplit(mat.m$X1, "\\.", c("site1", "treat1", "genotype1"))
mat.m[,c("site2", "treat2", "genotype2")] <-  colsplit(mat.m$X2, "\\.", c("site2", "treat2", "genotype2"))

Now, mat.m looks like:

> head(mat.m,3)
        X1      X2 value site1 treat1 genotype1 site2 treat2 genotype2
1  A.low.1 A.low.1  TRUE     A    low         1     A    low         1
2  B.low.1 A.low.1  TRUE     B    low         1     A    low         1
3  C.low.1 A.low.1 FALSE     C    low         1     A    low         1         

Finally, subset the rows you want to grab and get the columns in the order you want. Note, there are lots of ways to do this beyond subset, but I use it for clarity here:

subset(mat.m, site1 == site2 & treat1 == treat2 & genotype1 != genotype2,
       select = c("site1", "treat1", "genotype1", "genotype2", "value"))
#--------
    site1 treat1 genotype1 genotype2 value
7       A    low         2         1  TRUE
20      B    low         2         1  TRUE

You could probably do something more clever and avoid splitting the two columns out, but this seems to do what you want and should be reasonably straight forward.

Chase
  • 67,710
  • 18
  • 144
  • 161
  • That's fantastic, it takes me nearly all the way there. I've just applied it and it works well. I think I was about half the way there, but you've made my life a lot easier. If I could be greedy, I would like to exclude the double up comparisons (ie 1 vs 2 and 2 vs 1). Any quick ideas? – alexwhan May 08 '12 at 05:17
  • @alexwhan - awesome, glad this is helpful! Think about the relationship between `genotype1` and `genotype2`, currently we have it defined as inequality, or simply that `genotype1 != genotype2`. How would you modify that relationship to ensure that you only get one of the two comparisons, i.e. `1 vs 2` OR `2 vs 1`? – Chase May 08 '12 at 13:10
  • I'm sorry, I've racked my brain and failed. The only obvious thing seems to be >, but this doesn't make sense for factors. I could convert to numbers, but that seems cumbersome... – alexwhan May 09 '12 at 01:14
  • @alexwhan - we were thinking along the same lines...either `<` or `>`. What is the structure of `mat.m`, i.e. `str(mat.m)`? I show that `genotype1` and `genotype2` are both `int`, so the boolean comparison would work fine. – Chase May 09 '12 at 02:42