4

I have many X and Y variables (something like, 500 x 500). The following just small data:

yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
  Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
 X2 = sample (c(1,0, -1), 100, replace = T), 
 Xv3 = sample (c(1,0, -1), 100, replace = T), 
 D = sample (c(1,0, -1), 100, replace = T))

I want to extact p-values and make a matrix like this:

     Yv1    Y2     Yv3   
Xv1
X2
Xv3
D

Here is my attempt to loop the process:

prob = NULL
   anova.pmat <- function (x) {
            mydata <- data.frame(yvar = yvars[, x], xvars)
            for (i in seq(length(xvars))) {
              prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
              data = mydata))$`Pr(>F)`[1]
              }
              }
    sapply (yvars,anova.pmat)
    Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?

Edit:

For the first Y variable:

For first Y variable:

prob <- NULL
 mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
              prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
              data = mydata))$`Pr(>F)`[1]
              }

prob 
[1] 0.4995179 0.4067040 0.4181571 0.6291167

Edit again:

for (j in seq(length (yvars))){
        prob <- NULL
        mydata <- data.frame(yvar = yvars[, j], xvars)
         for (i in seq(length(xvars))) {
                  prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
                  data = mydata))$`Pr(>F)`[1]
                  }
}

Gives the same result as above !!!
shNIL
  • 963
  • 10
  • 24
  • 1
    The usual place to start is to get your function to work with one instance of the loop variable(s). Then you can deterrmine whether the loop body makes sense. At the moment I cannot figure out what you attempting, but you shoud try calling anova.pmat with one "x" at a time. (This effort may be revealing in that it should give you a better sense of hte false discovery rate with multiple testing.) – IRTFM Jul 10 '12 at 18:52
  • 1
    I'm having trouble understanding what you're trying to do with `mydata <- data.frame(yvar = yvars[, x], xvars)`. – mac Jul 10 '12 at 18:58
  • @DWin please see my recent edits – shNIL Jul 10 '12 at 19:02
  • @mac this will choose one yvar and other xvariables (see the edits), so that then different y varibles can be passed at different times ....I beleif – shNIL Jul 10 '12 at 19:04
  • 1
    @nilS. I see you edits, but you insted replace the function with a for-loop. That's not going to help you get that function working. You need to call the function with a single value of "x". Then you will discover that setting "prob" to NULL does nothing inside the function. I would think initializing to the proper lenght list or vector would be more helpful. – IRTFM Jul 10 '12 at 19:19
  • i'm pretty sure that your original call `sapply(yvars,anova.pmat)` does not give the behavior you describe "for the first Y variable". – mac Jul 10 '12 at 19:24

2 Answers2

4

Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.

library("plyr")

probs <- laply(xvars, function(x) {
    laply(yvars, function(y) {
        anova(lm(y~x))$`Pr(>F)`[1]
    })
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
1

Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:

dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
                   x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1, 
                   function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))

Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νar2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R2 in regression).

chl
  • 27,771
  • 5
  • 51
  • 71