R Data transform - Columns to Rows and aggregate

Question

I'm struggling with a data transformation in R. The data I receive is of this type:

input <- data.frame(AF = sample(0:1, 100, replace=TRUE),
                CAD = sample(0:1, 100, replace=TRUE),
                CHF = sample(0:1, 100, replace=TRUE),
                DEM = sample(0:1, 100, replace=TRUE),
                DIAB = sample(0:1, 100, replace=TRUE))
input$Counts <- rowSums(input)

And the output I'm trying to achieve is:

output <- data.frame(Condition = c('AF', 'CAD', 'CHF', 'DEM', 'DIAB'),
                 '1' = sample(11:20, 5, replace=TRUE),
                 '2' = sample(11:20, 5, replace=TRUE),
                 '3' = sample(11:20, 5, replace=TRUE),
                 '4' = sample(11:20, 5, replace=TRUE),
                 '5' = sample(11:20, 5, replace=TRUE))

Where the intersections are the counts of observations matching the condition (which are now in the first column) and the row sums (which are now separate columns).

My solution is below but I'm wondering if there is a more elegant solution?

data.frame(Condition = colnames(input[ ,1:5]),
       "One" = c(nrow(input[input$AF==1 & input$Counts==1,]),
                 nrow(input[input$CAD==1 & input$Counts==1,]),
                 nrow(input[input$CHF==1 & input$Counts==1,]),
                 nrow(input[input$DEM==1 & input$Counts==1,]),
                 nrow(input[input$DIAB==1 & input$Counts==1,])),
       "Two" = c(nrow(input[input$AF==1 & input$Counts==2,]),
                 nrow(input[input$CAD==1 & input$Counts==2,]),
                 nrow(input[input$CHF==1 & input$Counts==2,]),
                 nrow(input[input$DEM==1 & input$Counts==2,]),
                 nrow(input[input$DIAB==1 & input$Counts==2,])),
       "Three" = c(nrow(input[input$AF==1 & input$Counts==3,]),
                 nrow(input[input$CAD==1 & input$Counts==3,]),
                 nrow(input[input$CHF==1 & input$Counts==3,]),
                 nrow(input[input$DEM==1 & input$Counts==3,]),
                 nrow(input[input$DIAB==1 & input$Counts==3,])),
       "Four" = c(nrow(input[input$AF==1 & input$Counts==4,]),
                 nrow(input[input$CAD==1 & input$Counts==4,]),
                 nrow(input[input$CHF==1 & input$Counts==4,]),
                 nrow(input[input$DEM==1 & input$Counts==4,]),
                 nrow(input[input$DIAB==1 & input$Counts==4,])),
       "Five" = c(nrow(input[input$AF==1 & input$Counts==5,]),
                 nrow(input[input$CAD==1 & input$Counts==5,]),
                 nrow(input[input$CHF==1 & input$Counts==5,]),
                 nrow(input[input$DEM==1 & input$Counts==5,]),
                 nrow(input[input$DIAB==1 & input$Counts==5,])),
       "Six" = c(nrow(input[input$AF==1 & input$Counts==6,]),
                 nrow(input[input$CAD==1 & input$Counts==6,]),
                 nrow(input[input$CHF==1 & input$Counts==6,]),
                 nrow(input[input$DEM==1 & input$Counts==6,]),
                 nrow(input[input$DIAB==1 & input$Counts==6,]))
)

It helps to verbalize this: **"split by Counts, then aggregate each column by sum"** — smci, Mar 13 '17 at 12:14
Taking row-sums, then counting `nrow(input[input$Var==1 & input$Counts==whatever,])` is just an indirect way of aggregating by column-sum, split and then combined by Counts. — smci, Mar 13 '17 at 12:15

lmo · Accepted Answer · 2017-03-13T12:17:45.000

Maybe you are looking for aggregate.

Here is one solution.

myMat <- t(aggregate(.~Counts, data=input, FUN=sum)[-1,-1])
myMat
     2  3  4  5 6
AF   3 10 15 15 2
CAD  2 14 16 18 2
CHF  2 14 18 16 2
DEM  4  8 16 18 2
DIAB 5 14 22 17 2

The first argument to aggregate, . ~ Counts is a formula that says perform some operation on every column by Counts. The second argument specifies the data set, and the third argument notes that the desired operation is sum. The first column and row are dropped from the output using [-1, -1] since they are extraneous to the desired result. This output is then transposed with t. To change the column names, you can use colnames<- like

colnames(myMat) <- c("One", "Two", "Three", "Four", "Five")

reproducible data

set.seed(1234)
input <- data.frame(AF = sample(0:1, 100, replace=TRUE),
                    CAD = sample(0:1, 100, replace=TRUE),
                    CHF = sample(0:1, 100, replace=TRUE),
                    DEM = sample(0:1, 100, replace=TRUE),
                    DIAB = sample(0:1, 100, replace=TRUE))
input$Counts <- rowSums(input)

It helps to verbalize this: **"split by Counts, then aggregate each column by column-sum"** — smci, Mar 13 '17 at 12:19
Well that certainly save on typing! Thank you just what I was after. — Andrew Scotchmer, Mar 14 '17 at 12:25

score 0 · Answer 2 · answered Mar 13 '17 at 12:24

You can also use dplyr and tidyr to switch to and from a long and wide format (Although in this particular case, using aggregate is easier):

library(dplyr)
library(tidyr)

# take the input dataset
input %>%
        # transform to long format
        gather(condition, measurement,AF:DIAB) %>%
        # summarise by Counts and condition
        group_by(Counts, condition) %>%
        summarise(measure = sum(measurement)) %>%
        # transform back to the desired wide format
        spread(Counts, measure)

R Data transform - Columns to Rows and aggregate

2 Answers2