-7

In R I have a matrix of 47 rows and 30 columns. Each cell contains a numerical value (varying from 0.0 to 1.0). Some cells have "NA" instead of a numerical value.

This is what I would like to do:

  1. For each row, sample one random value until all 47 rows were sampled once. Only numerical values can be sampled (NA's should be ignored).
  2. Take these 47 values, compute the mean, and store the mean.
  3. Repeat this process 10,000 times with replacement.
  4. Determine the 95% interval (2.5%-97.5%) of these 10,000 means.
  5. Plot a histogram of the 10,000 means showing the boundaries the 2.5% and 97.5% interval.
  6. Determine whether an observed value falls inside or outside the boundaries.
  7. Compute the P-value of the observed mean.

It's important that only one sample is drawn from each row (randomly) and that every row is sampled once in every iteration.

I hope I'm not asking too much :-) I appreciate any help!

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
pbc
  • 15
  • 3
  • The reason you are getting so many down votes is because you are basically giving us an assignment rather than asking a question. You should have an example of your data (using `dput` or code to create it as in my answer below) as well as your attempt to solve the problem and where you got stuck. Also, the fact you are asking for so many things makes it seem like you made no attempt to solve this yourself (ie it is understandable to struggle with 1, but once you have 4, 6 is really easy and you should be able to do it yourself). – Barker Nov 03 '16 at 16:36
  • Thanks, Barker. I understand your point and I confess it was sad to see so many down votes before your answer was posted. Your answer was a big relief. Next time I will be more specific. Really appreciate it! – pbc Nov 04 '16 at 04:08
  • No problem. If my answer solved your problem, you should accept it. – Barker Nov 04 '16 at 16:50

1 Answers1

0

First lets make so data to experiment with

set.seed(100)
example <- matrix(runif(47*30), nrow = 47)
example[sample(length(example), 250)] <- NA

Now we can calculate our means. The apply function samples a random value from each row (!is.na excludes NA values), mean gets the mean, and replicate repeats this 10000 times.

exmeans <- replicate(10000, mean(apply(example, 1, 
                                 function(x) sample(x[!is.na(x)], 1))))

Confidence intervals can be computed two different ways. The first uses the example means as an empirical distribution and calculates the mean from that, the second uses a normal distribution to calculate the probablilities.

confint <- quantile(exmeans, c(0.025, 0.975))
confint <- qnorm(c(0.025, 0.975), mean = mean(exmeans), sd = sd(exmeans))

Next the plots you wanted

hist(exmeans)
abline(v = confint, col = "red")

histogram of quantiles

Finally the p-value information. Once again we can eigther use an empirical distribution or a normal distribution. These provide p-values for the lower tails of the distribution, use 1 - result for upper tails.

newvalue > confint[1] & newvalue < confint[2]
ecdf(exmeans)(newvalue)
pnorm(newvalue, mean = mean(exmeans), sd = sd(exmeans))
Barker
  • 2,074
  • 2
  • 17
  • 31