p-value matrix of x and y variables from anova output

Question

I have many X and Y variables (something like, 500 x 500). The following just small data:

yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
  Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
 X2 = sample (c(1,0, -1), 100, replace = T), 
 Xv3 = sample (c(1,0, -1), 100, replace = T), 
 D = sample (c(1,0, -1), 100, replace = T))

I want to extact p-values and make a matrix like this:

     Yv1    Y2     Yv3   
Xv1
X2
Xv3
D

Here is my attempt to loop the process:

prob = NULL
   anova.pmat <- function (x) {
            mydata <- data.frame(yvar = yvars[, x], xvars)
            for (i in seq(length(xvars))) {
              prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
              data = mydata))$`Pr(>F)`[1]
              }
              }
    sapply (yvars,anova.pmat)
    Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?

Edit:

For the first Y variable:

For first Y variable:

prob <- NULL
 mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
              prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
              data = mydata))$`Pr(>F)`[1]
              }

prob 
[1] 0.4995179 0.4067040 0.4181571 0.6291167

Edit again:

for (j in seq(length (yvars))){
        prob <- NULL
        mydata <- data.frame(yvar = yvars[, j], xvars)
         for (i in seq(length(xvars))) {
                  prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
                  data = mydata))$`Pr(>F)`[1]
                  }
}

Gives the same result as above !!!

The usual place to start is to get your function to work with one instance of the loop variable(s). Then you can deterrmine whether the loop body makes sense. At the moment I cannot figure out what you attempting, but you shoud try calling anova.pmat with one "x" at a time. (This effort may be revealing in that it should give you a better sense of hte false discovery rate with multiple testing.) — IRTFM, Jul 10 '12 at 18:52
I'm having trouble understanding what you're trying to do with `mydata <- data.frame(yvar = yvars[, x], xvars)`. — mac, Jul 10 '12 at 18:58
@mac this will choose one yvar and other xvariables (see the edits), so that then different y varibles can be passed at different times ....I beleif — shNIL, Jul 10 '12 at 19:04
@nilS. I see you edits, but you insted replace the function with a for-loop. That's not going to help you get that function working. You need to call the function with a single value of "x". Then you will discover that setting "prob" to NULL does nothing inside the function. I would think initializing to the proper lenght list or vector would be more helpful. — IRTFM, Jul 10 '12 at 19:19
i'm pretty sure that your original call `sapply(yvars,anova.pmat)` does not give the behavior you describe "for the first Y variable". — mac, Jul 10 '12 at 19:24

score 4 · Accepted Answer · answered Jul 10 '12 at 20:18

4

Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.

library("plyr")

probs <- laply(xvars, function(x) {
    laply(yvars, function(y) {
        anova(lm(y~x))$`Pr(>F)`[1]
    })
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)

answered Jul 10 '12 at 20:18

Brian Diggs

57,757
13
166
188

1

(+1) So much cleaner solution than mine! – chl Jul 10 '12 at 20:23

score 1 · Answer 2 · answered Jul 10 '12 at 20:10

Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:

dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
                   x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1, 
                   function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))

Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = ν_ar²/(1-r²), where degrees of freedom are defined as ν_a=(n-2)-#{(x_i=NA),(y_i=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R² in regression).

p-value matrix of x and y variables from anova output

2 Answers2