Insert NA's in case there are no observations when using subset() and then dcast or tapply

Question

I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.

> b
  subject   rt ac condition     z.score
1      99 1253  1     200_9  1.20862682
2      99 1895  1     102_2  2.95813507
3      99 1049  1      68_1  1.16862102
4      99 1732  1      68_9  2.94415384
5      99  765  1      34_9 -0.63991180
7      99 1016  1      68_2 -0.03191493

I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:

b1 <- subset(b, abs(z.score) < 1)

b2 <- dcast(b1, subject~condition, mean, value.var = "rt")

  subject      34_1      34_2      34_9      68_1      68_2     68_9     102_1     102_2    102_9     200_1     200_2    200_9
1      99 1028.5714  957.5385  861.6818  837.0000  969.7222 856.4000  912.5556  977.7273 858.7800 1006.0000 1015.3684 913.2449
2    5203  957.8889  815.2500  845.7750  933.0000  893.0000 883.0435  926.0000  879.2778 813.7308  804.2857  803.8125 843.7200
3    5205 1456.3333 1008.4286  850.7170 1142.4444  910.4706 998.4667  935.2500  980.9167 897.4681 1040.8000  838.7917 819.9710
4    5306 1022.2000  940.5882  904.6562 1525.0000 1216.0000 929.5167  955.8571  981.7500 902.8913  997.6000  924.6818 883.4583
5    5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6    5308  659.8571  904.2857  966.7755  960.9091 1048.6000 904.5082  836.2000 1753.6667 926.0400  870.2222 1066.6667 930.7500

In the example above for b1 each of the subjects had observations that met the subset demands. However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that? Any help will be greatly appreciated.

Best, Ayala

Also try to use `dput` so that we can see the exact structure of your code. — JasonAizkalns, Nov 26 '14 at 13:58
How can I format my post so you can see it better? Thanks, Ayala — ayalaall, Nov 26 '14 at 14:11
@AyalaAllon, Please have a look here for how to format: http://stackoverflow.com/help/formatting — Henrik, Nov 26 '14 at 14:28
Your problem is not clearly stated. You have some valid `data.frame` `b1` , so can you show us which columns are the ones that "won't have observations" and what sort of value indicates there's no observation present? For a simple example, if it were the case that `b1$ac !=1` means no observation, then you could do `b1$subject[which(b1$ac!=1)] <- NA` . — Carl Witthoft, Nov 26 '14 at 15:02
I'll rephrase my question. I have variable b- contains long data and I want to convert it to wide data(b2) after I subset(). In the subset() I want all rows in b that their values in "z.score" is smaller than 1 (in absolute value). This is what I get in b1. However, for some subjects, all observations have z.score > 1. In this case b1 will not contain that subject, and then when I'll use dcast() I end up with b2 that doesn't contain all subjects). I don’t want that. I want to create b2 with NA's for the conditions in which the subjects didn't had observations according to the subset I used. — ayalaall, Nov 26 '14 at 16:33
If `subject` is a factor instead of an integer, you can use the `drop` argument in `dcast` to keep missing combinations in output dataset. By default these are dropped, so use `drop = FALSE`. You'll get `NaN` filling all missing values (see `fill` argument). If you really want `NA` instead you could use something like `fill = as.numeric(NA)`. — aosmith, Nov 26 '14 at 16:34

score 2 · Accepted Answer · answered Nov 26 '14 at 16:48

2

There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.

Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.

library(reshape2)

bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2), 
             ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
             z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))

If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.

dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean, 
     value.var = "rt", drop = FALSE)

  subject   A B  C D
1      99 125 2 10 4

Make subject a factor.

bb$subject = factor(bb$subject)

Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.

dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean, 
     value.var = "rt", drop = FALSE) 

  subject   A   B   C   D
1      11 NaN NaN NaN NaN
2      99 125   2  10   4

To get NA instead of NaN you can use the fill argument.

dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean, 
     value.var = "rt", drop = FALSE, fill = as.numeric(NA)) 

  subject   A  B  C  D
1      11  NA NA NA NA
2      99 125  2 10  4

answered Nov 26 '14 at 16:48

aosmith

34,856
9
84
118

Hi aosmith! Thank you so much for you answer and help. This is indeed what I need to do. Thank you again!!!! – ayalaall Dec 02 '14 at 08:54
Hi @aosmith. If I have bb1, in which all "z.score" < 1.5. bb1 = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2), + ac=rep(1,8), condition=c(1,1,2,4,3,3,4,4), + z.score=c(0.2,0.3,0.2,0.3,0.3,0.2,0.2,0.2)). I want to use dcast with abs(z.score) > 1.5: bb1$subject = factor(bb1$subject). dcast(subset(bb1, abs(z.score) > 1.5), subject ~ condition, fun = length, value.var = "rt", drop = FALSE). But I get an error saying Error in dim(ordered) <- ns : dims [product 1] do not match the length of object [0]. – ayalaall Dec 02 '14 at 09:56
Hi again @asomith: do you know how I can solve the error I get in my previous comment? Best, Ayala – ayalaall Dec 02 '14 at 09:59
@AyalaAllon It looks like `dcast` isn't working with a 0-row data.frame. I would probably solve this by simply aggregrating the dataset first and then casting. For example, using package **dplyr** you could do something like `bb1 %>% group_by(subject, condition) %>% summarise(n = length(rt[abs(z.score) > 1.5])) %>% dcast(subject ~ condition, value.var = "n")`, which calculate 0 for group sizes or `bb1 %>% group_by(subject, condition) %>% summarise(n = n()[abs(z.score) > 1.5]) %>% dcast(subject ~ condition, value.var = "n")` if you want all `NA`. – aosmith Dec 02 '14 at 16:07
Hi @asomith. Thank you so much! I have one more small question regarding the use of decast in this situation. In your latest comment you used decast(subject ~ condition, value.var = "n"), which means that I'll get the value.var I want for each subject and condition. However what if I want also to do the same as you suggested, but instead of subject ~ condition I want to get the value.var for each subject (i.e., across condition). I still want to use dcast to do that, so I'll be able to combine the decast with your excellent sugesstion in your last comment. Any help will be greatly appreciated – ayalaall Dec 03 '14 at 16:38
@AyalaAllon I'm not entirely sure what you are trying to do (maybe sum across `value.var`?). I recommend asking a new question, which will make it much easier to include a reproducible example of what you'd like to do. You might also consider accepting this answer if it answered your original question. – aosmith Dec 03 '14 at 16:44
@asomith. Thank you. I'll post a new question now. Best, Ayala – ayalaall Dec 03 '14 at 16:51
Hi. I'm doing exactly what @asomith suggested: `test1 <- raw_data_rt %>% group_by(subject) %>% summarise(n = mean(rt[abs(z_score) < 1])) %>% dcast(subject ~ "t1rt", value.var = "n", fun = mean)`. The thing is that now I incorporate this within a function such that the name of the column to `group_by` is not always subject. One of my function arguments is `id` which is the name of the columns to `group_by`. However, if instead of subject I write `raw_data_rt[[id]]` I get `NA`s in the `t1rt` column. What can I do to solve this? Best, Ayala – ayalaall Jun 25 '15 at 11:56
@ayalaall You should put this as a new question with a reproducible example of your current problem. These functions you've been using usually need variable names from the given dataset to work properly so you'll need to figure out a way to pass the variable name to these within a function. – aosmith Jun 25 '15 at 14:27
Thank you. At the end I decided to change the names of the columns to fixed names using: `names(df)[names(df) == 'old.var.name'] <- 'new.var.name'` so I could work in the same way as I did before. Thanks! Ayala – ayalaall Jun 27 '15 at 12:16

sandro scodelller · Answer 2 · 2014-11-26T15:49:46.067

0

Is it the following you are after? I created a similar dataset "bb"

library("plyr")  ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"),     z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))

bb  
  subject  rt ac condition z.score
#1      99 100  1         A     0.2
#2      99 150  1         A     0.3
#3      99   2  1         B     0.2
#4      99   4  1         D     0.3
#5      99  10  1         C     1.5
#6      11  15  1         C    -0.3
#7      11   1  1         D     0.8
#8      11   2  1         D     0.7

Then you call dcast with subset included:

cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))  
cc  
   subject   A   B   C   D 
#1      11 NaN NaN  15 1.5
#2      99 125   2 NaN 4.0

edited Nov 26 '14 at 15:49

answered Nov 26 '14 at 15:44

sandro scodelller

400
3
8

1

Hi sandro, thank you so much for your answer. If I understand correctley your suggestion and the answer by aosmith (see above answer) are the same suggestion. This is exactly what I wanted. Thank you again! – ayalaall Dec 02 '14 at 08:58
Yes they are, but his is better because he also explains how to use the drop keyword :-) That is why I voted for his answer ;-) – sandro scodelller Dec 02 '14 at 09:05

Insert NA's in case there are no observations when using subset() and then dcast or tapply

2 Answers2