Bootstrap variables correlation in R

Question

My intention was to write several functions aimed at finding the overall similarity between two covariance matrices, either by multiplying them with random vectors and correlating the response vectors or by bootstrapping one of the matrices to obtain the correlation coefficient distribution that can serve for comparison. But in both cases I got erroneous results. The observed between-matrix correlation was high up to 0.93, but the distribution only ranged up to 0.2 the most. This is the function`s code:

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR)
}

In fact it is hard for me to belive that correlation coefficient between two original matrices is high, and the one between the first original matrix and the second re-sampled one is maximal 0.2 after 10000 bootstrap repetitions.

Any comments on the validity of the code?

The downvote wasn't necessary, but the question does not fulfill the standards set in the [FAQ](http://stackoverflow.com/faq). You might try at www.crossvalidated.com for statistical advice though.Take a look at the question here : http://stats.stackexchange.com/questions/14673/measure-of-similarity-or-distance-between-two-symmetric-covariance-matrices — Joris Meys, Oct 13 '11 at 15:25

score 2 · Accepted Answer · answered Oct 23 '11 at 15:19

Sorry, I am not enough educated to catch up your goals about checking the correlation efficient between two covariance matrices, but I tried to apprehend your code per se.

If I am right, you are making up 10.000 different matrices from the same matrix (mat2) by reordering the cells all round, and recomputing the correlation between the covariance matrix of mat1 and the covariance matrix of the resampled array. Those are stored in the statSim variable.

You said the original correaltion efficient was high (statObs), but the maximum of statSim was low, which is strange. I think the problem is with your result list:

list( obs = statObs , sumFit = sum(statSim > statObs)/numR)

Where you return the original correaltion coefficient (obs), but not the written maximum with sumFit. There you might use eg. max(statSim). I see the point in returning sumFit for checking if the resampling did any improvement to the correlation efficient, but based on your code, I see no problem about the theory.

Updated function with max of simulated correlation coefficients:

resamplerSimAlt <- function(mat1, mat2, numR, graph = FALSE)
{
  statSim <- numeric(numR)
  mat1vcv <- cov(mat1)
  mat2vcvT <- cov(mat2)
  ltM1 <- mat1vcv[col(mat1vcv) <= row(mat1vcv)]
  ltM2T <- mat2vcvT[col(mat2vcvT) <= row(mat2vcvT)]
  statObs <- cor(ltM1, ltM2T)                           
  indice <- c(1:length(mat2))
  resamplesIndices <- lapply(1:numR, function(i) sample(indice, replace = F))
  for (i in 1:numR)
  {
    ss <- mat2[sample(resamplesIndices[[i]])]
    ss <- matrix(ss, nrow = dim(mat2)[[1]], ncol = dim(mat2)[[2]])
    mat2ss <- cov(ss)
    ltM2ss <- mat2ss[col(mat2ss) <= row(mat2ss)]
    statSim[i] <- cor(ltM1, ltM2ss)
  }
  if (graph == TRUE)
  {
    plot(1, main = "resampled data density distribution", xlim = c(0, statObs+0.1), ylim = c(0,14))
    points(density(statSim), type="l", lwd=2)
    abline(v = statObs)
    text(10, 10, "observed corelation = ")
  }
  list( obs = statObs , sumFit = sum(statSim > statObs)/numR, max=max(statSim))
}

What I had run:

> mat1 <- matrix(runif(25),5,5)
> mat2 <- mat1+0.2
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 1

$sumFit
[1] 0

$max
[1] 0.94463

And with random mat2:

> mat2 <- matrix(runif(25),5,5)
> resamplerSimAlt(mat1, mat2, 10000)
$obs
[1] 0.31144

$sumFit
[1] 0.9124

$max
[1] 0.9231

My answer might not be a real answer. If that would be the case, please give more details about the problem :)

In fact the sumFit part of the function was meant to provide some sort of P value but I overlooked the sum, so it just should go like this `sumFit = length(statSim > statObs)/numR`. Thanks for the support. — Fedja Blagojevic, Oct 30 '11 at 18:38

Bootstrap variables correlation in R

1 Answers1