1

I need to aggregate (group by) a data frame by two categorical columns, but one column has some missing values (by design). But, the aggregate function removes the rows with missing values. Is there another way to keep ALL combinations of the two categorical columns?

Here is a reproducable example:

aggregating the CO2 data:

aggregate(. ~Plant+Type, data=CO2, sum, na.rm=TRUE) 

adding a row with a missing "Type":

new = data.frame(Plant="Test",Type = "",Treatment=7, conc = 200, uptake = 200,stringsAsFactors = FALSE)
CO2 = rbind(CO2,new)

Aggregating the new data doesn't change the output:

aggregate(. ~Plant+Type, data=CO2, sum, na.rm=TRUE)

What I was expecting to get is a new row with a Plant "Test":

   Plant        Type Treatment conc uptake
1    Qn1      Quebec         7 2950  216.6
2    Qn2      Quebec         7 3045  246.1
3    Qn3      Quebec         7 3045  263.3
4    Qc1      Quebec        14 3045  209.8
5    Qc3      Quebec        14 3045  228.1
6    Qc2      Quebec        14 3045  228.9
7    Mn3 Mississippi         7 3045  168.8
8    Mn2 Mississippi         7 3045  191.4
9    Mn1 Mississippi         7 3045  184.8
10   Mc2 Mississippi        14 3045   85.0
11   Mc3 Mississippi        14 3045  121.1
12   Mc1 Mississippi        14 3045  126.0
13   Test                    7  200  200

1 Answers1

0

Based on an answer at https://stackoverflow.com/a/8212864

One way is to use ddply from the plyr library:

library(plyr)
groupColumns = c("Plant","Type")
dataColumns = c("conc","uptake")

res = ddply(CO2, groupColumns, function(x) colSums(x[dataColumns]))
res

   Plant        Type conc uptake
1    Qn1      Quebec 3045  232.6
2    Qn2      Quebec 3045  246.1
3    Qn3      Quebec 3045  263.3
4    Qc1      Quebec 3045  209.8
5    Qc3      Quebec 3045  228.1
6    Qc2      Quebec 3045  228.9
7    Mn3 Mississippi 3045  168.8
8    Mn2 Mississippi 3045  191.4
9    Mn1 Mississippi 3045  184.8
10   Mc2 Mississippi 3045   85.0
11   Mc3 Mississippi 3045  121.1
12   Mc1 Mississippi 3045  126.0

new = data.frame(Plant="Test",Type = "",Treatment=7, conc = 200, uptake = 200,stringsAsFactors = FALSE)
CO2 = rbind(CO2,new)
res = ddply(CO2, groupColumns, function(x) colSums(x[dataColumns]))
res
   Plant        Type conc uptake
1    Qn1      Quebec 3045  232.6
2    Qn2      Quebec 3045  246.1
3    Qn3      Quebec 3045  263.3
4    Qc1      Quebec 3045  209.8
5    Qc3      Quebec 3045  228.1
6    Qc2      Quebec 3045  228.9
7    Mn3 Mississippi 3045  168.8
8    Mn2 Mississippi 3045  191.4
9    Mn1 Mississippi 3045  184.8
10   Mc2 Mississippi 3045   85.0
11   Mc3 Mississippi 3045  121.1
12   Mc1 Mississippi 3045  126.0
13  Test              200  200.0
Ilik
  • 165
  • 10