0

I'm struggling with a data transformation in R. The data I receive is of this type:

input <- data.frame(AF = sample(0:1, 100, replace=TRUE),
                CAD = sample(0:1, 100, replace=TRUE),
                CHF = sample(0:1, 100, replace=TRUE),
                DEM = sample(0:1, 100, replace=TRUE),
                DIAB = sample(0:1, 100, replace=TRUE))
input$Counts <- rowSums(input)

And the output I'm trying to achieve is:

output <- data.frame(Condition = c('AF', 'CAD', 'CHF', 'DEM', 'DIAB'),
                 '1' = sample(11:20, 5, replace=TRUE),
                 '2' = sample(11:20, 5, replace=TRUE),
                 '3' = sample(11:20, 5, replace=TRUE),
                 '4' = sample(11:20, 5, replace=TRUE),
                 '5' = sample(11:20, 5, replace=TRUE))

Where the intersections are the counts of observations matching the condition (which are now in the first column) and the row sums (which are now separate columns).

My solution is below but I'm wondering if there is a more elegant solution?

data.frame(Condition = colnames(input[ ,1:5]),
       "One" = c(nrow(input[input$AF==1 & input$Counts==1,]),
                 nrow(input[input$CAD==1 & input$Counts==1,]),
                 nrow(input[input$CHF==1 & input$Counts==1,]),
                 nrow(input[input$DEM==1 & input$Counts==1,]),
                 nrow(input[input$DIAB==1 & input$Counts==1,])),
       "Two" = c(nrow(input[input$AF==1 & input$Counts==2,]),
                 nrow(input[input$CAD==1 & input$Counts==2,]),
                 nrow(input[input$CHF==1 & input$Counts==2,]),
                 nrow(input[input$DEM==1 & input$Counts==2,]),
                 nrow(input[input$DIAB==1 & input$Counts==2,])),
       "Three" = c(nrow(input[input$AF==1 & input$Counts==3,]),
                 nrow(input[input$CAD==1 & input$Counts==3,]),
                 nrow(input[input$CHF==1 & input$Counts==3,]),
                 nrow(input[input$DEM==1 & input$Counts==3,]),
                 nrow(input[input$DIAB==1 & input$Counts==3,])),
       "Four" = c(nrow(input[input$AF==1 & input$Counts==4,]),
                 nrow(input[input$CAD==1 & input$Counts==4,]),
                 nrow(input[input$CHF==1 & input$Counts==4,]),
                 nrow(input[input$DEM==1 & input$Counts==4,]),
                 nrow(input[input$DIAB==1 & input$Counts==4,])),
       "Five" = c(nrow(input[input$AF==1 & input$Counts==5,]),
                 nrow(input[input$CAD==1 & input$Counts==5,]),
                 nrow(input[input$CHF==1 & input$Counts==5,]),
                 nrow(input[input$DEM==1 & input$Counts==5,]),
                 nrow(input[input$DIAB==1 & input$Counts==5,])),
       "Six" = c(nrow(input[input$AF==1 & input$Counts==6,]),
                 nrow(input[input$CAD==1 & input$Counts==6,]),
                 nrow(input[input$CHF==1 & input$Counts==6,]),
                 nrow(input[input$DEM==1 & input$Counts==6,]),
                 nrow(input[input$DIAB==1 & input$Counts==6,]))
)
smci
  • 32,567
  • 20
  • 113
  • 146
Andrew Scotchmer
  • 313
  • 3
  • 12
  • It helps to verbalize this: **"split by Counts, then aggregate each column by sum"** – smci Mar 13 '17 at 12:14
  • Taking row-sums, then counting `nrow(input[input$Var==1 & input$Counts==whatever,])` is just an indirect way of aggregating by column-sum, split and then combined by Counts. – smci Mar 13 '17 at 12:15

2 Answers2

1

Maybe you are looking for aggregate.

Here is one solution.

myMat <- t(aggregate(.~Counts, data=input, FUN=sum)[-1,-1])
myMat
     2  3  4  5 6
AF   3 10 15 15 2
CAD  2 14 16 18 2
CHF  2 14 18 16 2
DEM  4  8 16 18 2
DIAB 5 14 22 17 2

The first argument to aggregate, . ~ Counts is a formula that says perform some operation on every column by Counts. The second argument specifies the data set, and the third argument notes that the desired operation is sum. The first column and row are dropped from the output using [-1, -1] since they are extraneous to the desired result. This output is then transposed with t. To change the column names, you can use colnames<- like

colnames(myMat) <- c("One", "Two", "Three", "Four", "Five")

reproducible data

set.seed(1234)
input <- data.frame(AF = sample(0:1, 100, replace=TRUE),
                    CAD = sample(0:1, 100, replace=TRUE),
                    CHF = sample(0:1, 100, replace=TRUE),
                    DEM = sample(0:1, 100, replace=TRUE),
                    DIAB = sample(0:1, 100, replace=TRUE))
input$Counts <- rowSums(input)
lmo
  • 37,904
  • 9
  • 56
  • 69
0

You can also use dplyr and tidyr to switch to and from a long and wide format (Although in this particular case, using aggregate is easier):

library(dplyr)
library(tidyr)

# take the input dataset
input %>%
        # transform to long format
        gather(condition, measurement,AF:DIAB) %>%
        # summarise by Counts and condition
        group_by(Counts, condition) %>%
        summarise(measure = sum(measurement)) %>%
        # transform back to the desired wide format
        spread(Counts, measure)
ira
  • 2,542
  • 2
  • 22
  • 36