Here's some sample data:
dat="x1 x2 x3 x4 x5
1 C 1 16 NA 16
2 A 1 16 16 NA
3 A 1 16 16 NA
4 A 4 64 64 NA
5 C 4 64 NA 64
6 A 1 16 16 NA
7 A 1 16 16 NA
8 A 1 16 16 NA
9 B 4 64 32 32
10 A 3 48 48 NA
11 B 4 64 32 32
12 B 3 48 32 16"
data<-read.table(text=dat,header=TRUE)
aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data)
x1 x2 x3 x4 x5
1 B 11 176 96 8
How do I get the sum of A
and C
as well in x1
?
aggregate(.~x1, FUN=sum, data=data, na.action = na.omit)
x1 x2 x3 x4 x5
1 B 11 176 96 80
When I use sqldf
:
library("sqldf")
sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1")
sum(x2) sum(x3) sum(x4) sum(x5)
1 12 192 192 <NA>
2 11 176 96 80
3 5 80 NA 80
Why do I get <NA>
in the first line, but NA
in the third line ?
What is the differences between them? Why do I get the <NA>
? there is no <NA>
in data!
str(data)
'data.frame': 12 obs. of 5 variables:
$ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ...
$ x2: int 1 1 1 4 4 1 1 1 4 3 ...
$ x3: int 16 16 16 64 64 16 16 16 64 48 ...
$ x4: int NA 16 16 64 NA 16 16 16 32 48 ...
$ x5: int 16 NA NA NA 64 NA NA NA 32 NA ...
The sqldf problem remains here, why sum(x4)
gets NA
, on the contrary sum(x5)
gets <NA>
?
I can prove that all NA
both in x4 and x5 is the same this way:
data[is.na(data)] <- 0
> data
x1 x2 x3 x4 x5
1 C 1 16 0 16
2 A 1 16 16 0
3 A 1 16 16 0
4 A 4 64 64 0
5 C 4 64 0 64
6 A 1 16 16 0
7 A 1 16 16 0
8 A 1 16 16 0
9 B 4 64 32 32
10 A 3 48 48 0
11 B 4 64 32 32
12 B 3 48 32 16
So the fact that sqldf treats sum(x4)
and sum(x5)
differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.