1

I have a data.frame with 8 columns. One is for the list of subjects (one row per subject) and the other 7 rows are a score of either 1 or 0. This is what the data looks like:

>head(splitkscores)
  subject block3 block4 block5 block6 block7 block8 block9
1   40002      0      0      1      0      0      0      0
2   40002      0      0      1      0      0      1      1
3   40002      1      1      1      1      1      1      1
4   40002      1      1      0      0      0      1      0
5   40002      0      1      0      0      0      1      1
6   40002      0      1      1      0      1      1      1

I want to create a data.frame with 3 columns. One column for subjects. In the other two columns, one must have the sum of 3 or 4 randomly chosen numbers from each row of my data.frame (except the subject) and the other column must have the sum of the remaining values which were not chosen in the first random sample.

Help is much appreciated. Thanks in advance

HernanLG
  • 664
  • 3
  • 7
  • 18

2 Answers2

0

I think this'll do it: [changed the way data were read in based on the other response because I made a manual mistake...]

   splitkscores <- read.table(text = "  subject block3 block4 block5 block6 block7 block8 block9
1   40002      0      0      1      0      0      0      0
2   40002      0      0      1      0      0      1      1
3   40002      1      1      1      1      1      1      1
4   40002      1      1      0      0      0      1      0
5   40002      0      1      0      0      0      1      1
6   40002      0      1      1      0      1      1      1", header = TRUE)

   df2 <- data.frame(subject = splitkscores$subject, sum3or4 = NA, leftover = NA)
   df2$sum3or4 <- apply(splitkscores[,2:ncol(splitkscores)], 1, function(x){
       sum(sample(x, sample(c(3,4),1), replace = FALSE))
     })
   df2$leftover <- rowSums(splitkscores[,2:ncol(splitkscores)]) - df2$sum3or4

   df2
     subject sum3or4 leftover
   1   40002       1        0
   2   40002       2        1
   3   40002       3        4
   4   40002       1        2
   5   40002       2        1
   6   40002       1        4
tim riffe
  • 5,651
  • 1
  • 26
  • 40
  • Many thanks for the quick reply. I should clarify I made a mistake when presenting my original data.frame. The column for subjects had been wrongly specified, each row of that column is a different subject, so I didn't use the first part of your code. The second part of your code, I modified it slightly to include columns from 2 to 8, instead of 2:7, which was leaving one column out. The data.frame produced is indeed what I was looking for. Many thanks for your help and your elegant code. – HernanLG Jun 08 '12 at 23:26
  • I'm not 100% sure it is. Do you want the sample of columns to differ row by row? I imagine not (based on my current imaginings of why you might want to do this sort of operation). – Tim P Jun 08 '12 at 23:41
  • For instance, sum3or4 is greater for row 5 than row 6, which would never happen if the subset of columns was selected once. A separate sample for each row doesn't feel like the right thing to do (if you're trying to do some bootstrappy type stuff for instance). – Tim P Jun 08 '12 at 23:44
  • yeah, I changed the 7 to `ncol(splitkscores)` to make it more flexible. The 7 was due to my original sloppy manual greation of `splitkscores`. glad it was useful – tim riffe Jun 08 '12 at 23:54
  • I'm doing this to do a split-half correlation of a scoring method. The 1s and 0s were scores assigned to each subject based on a formula. What I was doing now was calculating the score for each subject for two halves of the data and see if they were correlated, which would show internal consistency. The r was unexpectedly low, which makes me think the code I originally used from this example was not quite right (my fault, obviously). I used the code provided by Tim P in the other answer and I think it worked. Please see my comment in Tim P's answer for my final solution. Many thanks – HernanLG Jun 09 '12 at 00:28
0

Here's a neat and tidy solution free of unnecessary complexity (assume the input is called df):

chosen=sort(sample(setdiff(colnames(df),"subject"),sample(c(3,4),1)))
notchosen=setdiff(colnames(df),c("subject",chosen))
out=data.frame(subject=df$subject,
               sum1=apply(df[,chosen],1,sum),sum2=apply(df[,notchosen],1,sum))

In plain English: sample from the column names other than "subject", choosing a sample size of either 3 or 4, and call those column names chosen; define notchosen to be the other columns (excluding "subject" again, obviously); then return a data frame with the list of subjects, the sum of the chosen columns, and the sum of the non-chosen columns. Done.

Tim P
  • 1,383
  • 9
  • 19
  • I used this >chosen=sort(sample(setdiff(colnames(splitkscores),"subject"),sample(c(3,4),1))) >notchosen=setdiff(colnames(splitkscores),c("subject",chosen)) > out=data.frame(subject=splitkscores$subject, + sum1=apply(splitkscores[,chosen],1,sum),sum2=apply(splitkscores[,notchosen],1,sum)) And apparently it worked (at least my correlation makes more sense now) Many thanks! – HernanLG Jun 09 '12 at 00:29
  • as far as I can tell, the difference here is that my code randomly chooses 3 or 4 columns for each row, whereas this code consistently uses either 3 or 4 columns, and they are the same columns for the entire operation. – tim riffe Jun 09 '12 at 00:34
  • Yes, since using a consistent set of columns makes much more mathematical and statistical sense. The code's also much cleaner imho, but then these things are a little bit subjective :) – Tim P Jun 09 '12 at 06:02
  • No problem Hernan, glad to be of assistance :) – Tim P Jun 09 '12 at 06:03