How to get all the sum in aggregate function?

Question

Here's some sample data:

dat="x1 x2 x3 x4 x5
1   C  1 16 NA 16
2   A  1 16 16 NA
3   A  1 16 16 NA
4   A  4 64 64 NA
5   C  4 64 NA 64
6   A  1 16 16 NA
7   A  1 16 16 NA
8   A  1 16 16 NA
9   B  4 64 32 32
10  A  3 48 48 NA
11  B  4 64 32 32
12  B  3 48 32 16"

data<-read.table(text=dat,header=TRUE)   
aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data)   
 x1 x2  x3 x4 x5   
1  B 11 176 96 8

How do I get the sum of A and C as well in x1?

 aggregate(.~x1, FUN=sum, data=data, na.action = na.omit)  
   x1 x2  x3 x4 x5
 1  B 11 176 96 80

When I use sqldf:

library("sqldf")
sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1")
  sum(x2) sum(x3) sum(x4) sum(x5)
1      12     192     192    <NA>
2      11     176      96      80
3       5      80      NA      80

Why do I get <NA> in the first line, but NA in the third line ? What is the differences between them? Why do I get the <NA>? there is no <NA> in data!

str(data)
'data.frame':   12 obs. of  5 variables:
 $ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ...
 $ x2: int  1 1 1 4 4 1 1 1 4 3 ...
 $ x3: int  16 16 16 64 64 16 16 16 64 48 ...
 $ x4: int  NA 16 16 64 NA 16 16 16 32 48 ...
 $ x5: int  16 NA NA NA 64 NA NA NA 32 NA ...

The sqldf problem remains here, why sum(x4) gets NA, on the contrary sum(x5) gets <NA>?

I can prove that all NA both in x4 and x5 is the same this way:

data[is.na(data)] <- 0     

> data
   x1 x2 x3 x4 x5
1   C  1 16  0 16
2   A  1 16 16  0
3   A  1 16 16  0
4   A  4 64 64  0
5   C  4 64  0 64
6   A  1 16 16  0
7   A  1 16 16  0
8   A  1 16 16  0
9   B  4 64 32 32
10  A  3 48 48  0
11  B  4 64 32 32
12  B  3 48 32 16

So the fact that sqldf treats sum(x4) and sum(x5) differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.

Maybe this helps: http://stackoverflow.com/questions/8859124/na-values-using-sqldf — lukeA, Dec 30 '13 at 12:07
You get `` to distinguish a real `NA` value from the character representation of `NA`, e.g. `"NA"`. If you look at the return value from running that command, you get a `data.frame` in which the first three columns are of type `integer` and the fourth is of type `character`. I guess `sqldf` is converting the fourth to a factor somewhere along the way. Try `str( sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1") )` to see what I mean. — Simon O'Hanlon, Dec 30 '13 at 13:51
SQLite assigns the column affinity according to the first row of the column and uses text if its NULL. Some workarounds are: (1) use the same name for the output column as the input in which case sqldf will deduce that you wanted to coerce back to that type, (2) use `total` in place of `sum` in which case zero rows will total to 0 rather than NULL so the problem does not occur, (3) use sqldf's `method` arg to specify the classes, (4) use one of the other databases that sqldf supports (H2, MySQL, PostgreSQL) instead of SQLite. See `?sqldf` and http://sqldf.googlecode.com for more info. — G. Grothendieck, Jan 03 '14 at 17:22

score 6 · Answer 1 · answered Dec 30 '13 at 12:38

Here's the data.table way in case you're interested:

require(data.table)
dt <- data.table(data)
dt[, lapply(.SD, sum, na.rm=TRUE), by=x1]
#    x1 x2  x3  x4 x5
# 1:  C  5  80   0 80
# 2:  A 12 192 192  0
# 3:  B 11 176  96 80

If you want sum to return NA instead of the sum after removing NA's, just remove the na.rm=TRUE argument.

.SD here is an internal data.table variable that constructs, by default, all the columns not in by - here all except x1. You can check the contents of .SD by doing:

dt[, print(.SD), by=x1]

to get an idea of what's .SD. If you're interested check ?data.table for other internal (and very useful) special variables like .I, .N, .GRP etc..

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-01-01T16:56:45.643

Because of how the formula method for aggregate handles NA values by default, you need to override that before using the na.rm argument from sum. You can do this by setting na.action to NULL or na.pass:

aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data, 
          na.rm = TRUE, na.action = NULL)
#   x1 x2  x3  x4 x5
# 1  A 12 192 192  0
# 2  B 11 176  96 80
# 3  C  5  80   0 80

aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data, 
          na.rm = TRUE, na.action = na.pass)
#   x1 x2  x3  x4 x5
# 1  A 12 192 192  0
# 2  B 11 176  96 80
# 3  C  5  80   0 80

Regarding sqldf, it seems like the columns are being cast to different types depending on whether the item in the first row of the first grouping variable is an NA or not. If it is an NA, that column gets cast as character.

Compare:

df1 <- data.frame(id = c(1, 1, 2, 2, 2),
                 A = c(1, 1, NA, NA, NA),
                 B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df1 group by id")
#   sum(A) sum(B)
# 1      2   <NA>
# 2     NA    3.0

df2 <- data.frame(id = c(2, 2, 1, 1, 1),
                  A = c(1, 1, NA, NA, NA),
                  B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df2 group by id")
#   sum(A) sum(B)
# 1   <NA>      3
# 2    2.0     NA

However, there is an easy workaround: reassign the original name to the new columns being created. Perhaps that let's SQLite inherit some of the information from the previous database? (I don't really use SQL.)

Example (with the same "df2" created earlier):

sqldf("select sum(A) `A`, sum(B) `B` from df2 group by id")
#    A  B
# 1 NA  3
# 2  2 NA

You can easily use paste to create your select statement:

Aggs <- paste("sum(", names(data)[-1], ") `", 
              names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
#   x2  x3  x4 x5
# 1 12 192 192 NA
# 2 11 176  96 80
# 3  5  80  NA 80
str(.Last.value)
# 'data.frame':  3 obs. of  4 variables:
#  $ x2: int  12 11 5
#  $ x3: int  192 176 80
#  $ x4: int  192 96 NA
#  $ x5: int  NA 80 80

A similar approach can be taken if you want NA to be replaced with 0:

Aggs <- paste("sum(ifnull(", names(data)[-1], ", 0)) `", 
              names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
#   x2  x3  x4 x5
# 1 12 192 192  0
# 2 11 176  96 80
# 3  5  80   0 80

dear Ananda Mahto ,you still do not answer why sqldf treat the NA differently. — showkey, Jan 01 '14 at 14:18
@it_is_a_literature, I don't have an exact answer *why*, but I have a solution. I'm guessing this isn't as much a problem from the side of "sqldf", but more of something to do with SQLite. See [FAQ 14 at this page](https://code.google.com/p/sqldf/#14._How_does_one_read_files_where_numeric_NAs_are_represented_as) for a similar problem while reading data in. I'm presuming something similar is happening here. — A5C1D2H2I1M1N2O1R2T1, Jan 01 '14 at 17:01

Davide Passaretti · Answer 3 · 2013-12-30T12:29:52.000

2

aggregate(data[, -1], by=list(data$x1), FUN=sum)

I eliminated the first column because you don't use it in the sum, it is just a group variable to split the data (as a matter of fact I then used it in "by")

edited Dec 30 '13 at 12:29

answered Dec 30 '13 at 12:01

Davide Passaretti

2,741
1
21
32

what is `data[,-1]` there? – janos Dec 30 '13 at 12:05
I eliminated the first column because you don't have to consider it in the sum, it is just a group variable you have to use to split the data (as a matter of fact I used it in "by") – Davide Passaretti Dec 30 '13 at 12:10
Maybe you could add this explanation in your answer to make it more complete and useful ;-) – janos Dec 30 '13 at 12:23
2

You may also want, `na.rm=TRUE`? (depends on the OPs real data - in this example *all* values in a group are `NA` **or** all not `NA`. – Simon O'Hanlon Dec 30 '13 at 13:44

score 2 · Answer 4 · answered Jan 01 '14 at 16:14

Here's how you would do this with the reshape package:

> # x1 = identifier variable, everything else = measured variables
> data_melted <- melt(data, id="x1", measured=c("x2", "x3", "x4", "x5"))
>
> # Thus we now have (measured variable and it's value) per x1 (id variable)
> head(data_melted)
  x1 variable value
1  C       x2     1
2  A       x2     1
3  A       x2     1
4  A       x2     4
5  C       x2     4
6  A       x2     1

> tail(data_melted)
   x1 variable value
43  A       x5    NA
44  A       x5    NA
45  B       x5    32
46  A       x5    NA
47  B       x5    32
48  B       x5    16

> # Now aggregate using sum, passing na.rm to it
> cast(data_melted, x1 ~ ..., sum, na.rm=TRUE)
  x1 x2  x3  x4 x5
1  A 12 192 192  0
2  B 11 176  96 80
3  C  5  80   0 80

Alternatively, you could have done na.rm during the melt()-ing process itself.

The great thing about learning library(reshape) is, quoting the author ("Reshaping Data with the reshape Package"),

"In R, there are a number of general functions that can aggregate data, for example tapply, by and aggregate, and a function speciﬁcally for reshaping data, reshape. Each of these functions tends to deal well with one or two speciﬁc scenarios, and each requires slightly diﬀerent input arguments. In practice, you need careful thought to piece together the correct sequence of operations to get your data into the form that you want. The reshape package grew out of my frustrations with reshaping data for consulting clients, and overcomes these problems with a general conceptual framework that uses just two functions: melt and cast."

How to get all the sum in aggregate function?

4 Answers4