randomly select values from each row across columns in a data.frame and average them in R

Question

This question is similar to a previous one I made here: randomly sum values from rows and assign them to 2 columns in R

Since I'm having difficulties with R, this question is both about programming and statistics. I'm very new to both.

I have a data.frame with 219 subjects in one column. The rest of the columns are 7, and in each row I have a number which represents a difference in response time for that particular subject when exposed to the two conditions of the experiment.

This is how the data looks (I'm using the head function, otherwise it would be too long):

    > head(RTsdiff)
      subject   block3diff   block4diff   block5diff   block6diff   block7diff
    1   40002  0.076961798  0.046067460 -0.027012048  0.017920261  0.002660317
    2   40004  0.037558511 -0.016535211 -0.044306743 -0.011541667  0.044422892
    3   40006 -0.017063123 -0.031156150 -0.084003876 -0.070227149 -0.113382784
    4   40008 -0.015204017 -0.009954545 -0.004082353  0.006327839  0.022335271
    5   40009  0.006055829 -0.045376437 -0.002725572  0.016443182  0.032848128
    6   40010 -0.003017857 -0.034398268 -0.034476491  0.014158824 -0.036592982
       block8diff    block9dif
    1  0.03652273  0.037306173
    2 -0.08032784 -0.150682051
    3 -0.09724864 -0.060338684
    4 -0.04783333  0.006539326 
    5 -0.01459465 -0.067916667
    6 -0.01868126 -0.034409584

What I need is a code that will, for every subject (i.e. every row) will sample either 3 or 4 values, average them, and add them to a new vector (called half1). The vector half2 should have the average of the values that were not sampled in the first try.

So, supposing the data.frame I want t create was called "RTshalves", I would need the first column to be the same column of subjects in RTsdiff, the second column must have in the first row the average of the randomly selected values that correspond to the first subject, and the second column must have the average of the values of the first subject that were not chosen in the first sampling. The second row of columns 2 and 3 should have the same information, but this time for subject 2 (that is subject 40004 in my data.frame), etc, until reaching the 219 subjects.

Let's suppose that the first sample randomly selected 3 values of subject 1 (block3diff, block5diff and block9diff) and thus the values of block4diff, block6diff, block7diff and block8diff would automatically correspond to the other half. Then, what I would expect to see (considering only the first of the 219 rows) is:

   Subject     Half1       Half2 
    40002   0.02908531   0.02579269

If anyone is interested in the statistics behind this, I'm trying to do a split-half reliability test to check for the consistency of a test. The rationale is that if the difference in RT average is a reliable estimator of the effect, then the differences of half of the blocks of one participant should be correlated to the differences of the other half of the blocks.

Help is much appreciated. Thanks in advance.

Ari B. Friedman · Accepted Answer · 2012-06-11T13:18:11.513

1

half1 is easy: write your own function to do what you want to each row (taken in as a vector), then apply it to the rows:

eachrow <- function(x) {
   mean(sample(x,2))
}
RTsdiff$half1 <- apply(eachrow,1,RTsdiff)

To get half2, you'll probably want to do it at the same time. ddply might be easiest for this (let the by argument be your subject variable to get each row). Like this:

RTsdiff <- data.frame(subject=seq(6))
RTsdiff <- cbind( RTsdiff, matrix(runif(6*8),ncol=8) )

library(plyr)
eachrow <- function(x,n=3) {
  x <- as.numeric(x[,2:ncol(x)]) # eliminate the ID column to make things easier, make a vector
  s <- seq(length(x))
  ones <- sample(s,n) # get ids for half1
  twos <- !(s %in% ones) # get ids for half2
  data.frame( half1=mean(x[ones]), half2=mean(x[twos]) )
}
ddply( RTsdiff, .(subject), eachrow)

  subject     half1     half2
1       1 0.4700982 0.5350610
2       2 0.6173469 0.5351995
3       3 0.2245246 0.6807482
4       4 0.6330649 0.6316353
5       5 0.6388060 0.6629077
6       6 0.4652086 0.5073034

There are plenty of more elegant ways of doing this. In particular, I used ddply for its ability to easily output data.frames so that I could output both half1 and half2 from the function and have them combined up nicely at the end, but ddply takes data.frames as input, so there's some slight machination to get it out to a vector first. Feeding sapply a transposed data.frame would possibly be simpler.

edited Jun 11 '12 at 13:18

answered Jun 11 '12 at 13:00

Ari B. Friedman

71,271
35
175
235

I think to meet the OP's desire, your `eachrow` function should exclude the first column. Maybe `mean(sample(x[-1],2))` would suffice if I'm right. – Carl Witthoft Jun 11 '12 at 13:48
You mean exclude `block3diff`? That wasn't my reading but if so as you point out it's very easy to accomplish. – Ari B. Friedman Jun 11 '12 at 13:59
Thanks for your answer gsk3, but I'm a bit confused about it. First, why are you defining the function eachrow() twice? (in the first chunk of code of your answer and on the second). Second, why are you re-defining RTsdiff, also twice, and overriding the data I have. Are you doing that to have an example data.frame? Third, if the argument n=3, then the numbers that get sampled in "ones" will always be 3, and the objects sampled to "twos" will always be four. I need the sampling to be random (i.e. sometimes 3 to "ones" sometimes 4 to "ones"). Many thanks for your help. – HernanLG Jun 11 '12 at 16:00
1) The first was a simpler example to illustrate the principle of how the `apply` family functions work. It has little to do with the second. 2) You did not provide a sample dataset, so I created one. In the future, please use `dput` so that it's easy to re-create your example. 3) Easy enough to change. Just add a line within the `eachrow` function that samples: `n <- sample(c(3,4))` – Ari B. Friedman Jun 11 '12 at 16:30
Thanks for your clarifications gsk3. I had actually tried the n=sample(c(3:4)) but I tried to specify that in the line when I define the arguments of the eachrow function (that gave me an error). What I did following your advice now was to remove the value 3 of n in the argument definition (so it just looks like eachfunction(x) and then I define n as sample(c(3:4)) The results I'm getting out of the correlation are very different from what I expected so that makes me doubt I'm doing it right, but if the code does what I described in the question, then I guess I have very peculiar results :) – HernanLG Jun 11 '12 at 16:53
That was my intent (take it out of the arguments). Always a bummer when stuff doesn't work out. I'd double-check to make sure my code does what you want it to though. Sometimes difficult to match people's descriptions of their problem and desired result to what they actually mean--code is a much clearer way of describing intent sometimes! – Ari B. Friedman Jun 11 '12 at 17:51
@gsk3 No, I just thought the setup had "subject" as the first column in `RTSdiff` . Sorry for the confusion. – Carl Witthoft Jun 11 '12 at 18:50
I would like to point out that when I changed the n argument and defined it as ab object, I wrote n <- sample(c(3:4)) This was a mistake, because n was sampling 3 AND 4 over and over again. That is why my correlation was so high. I changed it to n <- sample(c(3:4),1) Now my values makes sense Thanks a lot for all the help :) – HernanLG Jun 11 '12 at 20:49
@Hernan_L Ah, good point. I mis-remembered the default behavior of `sample` when given only one argument. Nice catch. – Ari B. Friedman Jun 12 '12 at 03:03
I also found something else. When defining "twos" inside the eachrow function, they way you put it gives a vector of logical values. I was surprised that I got number in the final output at all! I changed it from: twos <- !(s %in% ones) to twos <- s[!(s %in% ones)] I also realized that when redefining the data.frame as x in the "eachrow" function, the as.numeric function is incompatible with the 2 dimensions of the original data.frame RTsdiff. Finally, I wonder if the data.frame output from each row able to index correctly, given what I just mentioned. – HernanLG Jun 12 '12 at 05:41
@Hernan_L: You can index a vector by numerical indices or logical vectors. Your `x[s[!(s %in% ones)]]` is equivalent to my `x[!(s %in% ones)]`. The `as.numeric` line should work as intended (I tested it) as long as the data.frame is a single row. That will only not be the case if the `subject` column is not unique for every row (which was the assumption I built the case around). – Ari B. Friedman Jun 12 '12 at 11:10
Thanks gsk. You're right about the as.numeric for x, I missinterpreted while trying to reproduce it. Indeed there is one subject per row. Results are fine now :) Thanks a lot – HernanLG Jun 12 '12 at 14:31

randomly select values from each row across columns in a data.frame and average them in R

1 Answers1