0

Say I have a dataset like this:

df <- data.frame(id = c(1, 1, 1, 2, 2),
      classname = c("Welding", "Welding", "Auto", "HVAC", "Plumbing"),
      hours = c(3, 2, 4, 1, 2))  

I.e.,

   id  classname  hours
1   1   Welding     3
2   1   Welding     2
3   1   Auto        4
4   2   HVAC        1
5   2   Plumbing    2

I'm trying to figure out how to summarize the data in a way that gives me, for each id, a list of the classes they took as well as how many hours of each class. I would want these to be in a list so I can keep it one row per id. So, I would want it to return:

   id     class.list     class.hours
1   1    Welding, Auto       5,4   
2   2    HVAC, Plumbing      1,2    

I was able to figure out how to get it to return the class.list.

library(dplyr)
classes <- df %>%
group_by(id) %>%
summarise(class.list = list(unique(as.character(classname)))) 

This gives me:

   id     class.list     
1   1    Welding, Auto         
2   2    HVAC, Plumbing      

But I'm not sure how I could get it to sum the number of hours for each of those classes (class.hours).

Thanks for your help!

Jacob Curtis
  • 788
  • 1
  • 8
  • 22

2 Answers2

1

In base R, this can be accomplished with two calls to aggregate. The inner call sums the hours and the outer call "concatenates" the hours and the class names. In the outer call of aggregate, cbind is used to include both the hours and the class names in the output, and also to provide the desired variable names.

# convert class name to character variable
df$classname <- as.character(df$classname)
# aggregate
aggregate(cbind("class.hours"=hours, "class.list"=classname)~id,
          data=aggregate(hours~id+classname, data=df, FUN=sum), toString)
  id class.hours     class.list
1  1        4, 5  Auto, Welding
2  2        1, 2 HVAC, Plumbing

In data.table, roughly the same output is produced with a chained statement.

setDT(df)[, .(hours=sum(hours)), by=.(id, classname)][, lapply(.SD, toString), by=id]
   id      classname hours
1:  1  Welding, Auto  5, 4
2:  2 HVAC, Plumbing  1, 2

The variable names could then be set using the data.table setnames function.

lmo
  • 37,904
  • 9
  • 56
  • 69
1

This is how you could do it using dplyr:

classes <- df %>%
  group_by(id, classname) %>%
  summarise(hours = sum(hours)) %>%
  summarise(class.list = list(unique(as.character(classname))),
            class.hours = list(hours)) 

The first summarise peels of the latest group by (classname). It is not necessary to use unique() anymore, but I kept it in there to match the part you already had.

onnhoJ
  • 56
  • 7