Merging specific rows by summing certain columns on grouping variables

Question

The following dataframe is a subset of a bigger df, which contains duplicated information

df<-data.frame(Caught=c(92,134,92,134),
               Discarded=c(49,47,49,47),
               Units=c(170,170,220,220),
               Hours=c(72,72,72,72),
               Colour=c("red","red","red","red"))

In Base R, I would like to get the following:

df_result<-data.frame(Caught=226,
                      Retained=96,
                      Units=390,
                      Hours=72,
                      colour="red")

So basically the results is the sum of unique values for columns Caught, Retained, Units and leaving the same value for Hours and colour (Caught=92+134, Retained=49+47, Units= 170+220, Hours=72, colour="red)

However, I intend to do this in a much bigger data.frame with several columns. My idea was to apply a function based on column names as:

l <- lapply(df, function(x) {
  if(names(x) %in% c("Caught","Discarded","Units"))
    sum(unique(x))
  else
    unique(x)
})
as.data.frame(l)

However, this does not work, as I am not entirely sure how to extract vector names when using lapply() and other functions such as this.

I have tried withouth succes to implement by(), apply() functions.

Thanks

score 1 · Answer 1 · answered Feb 23 '21 at 13:22

1

Asking for Base R:

    l <- lapply( df, function(n) {
        if( is.numeric(n) )
            sum( unique(n) )
        else
            unique( n )
    })
    as.data.frame(l)

This solution takes advantage of the fact that data.frames are really just lists of vectors.

It produces this:

    #  Caught Discarded Units Hours Colour
    #    226        96   390    72    red

answered Feb 23 '21 at 13:22

Sirius

5,224
2
14
21

Thank you very much. I was hoping for a less automated way of specifying which columns to sum. Instead of using as.numeric, how would I specify the columns names that I want to sum? – gfmg1992 Feb 23 '21 at 16:35
See the answer by baroulotte for how to explicitly handle each name in the data.frame – Sirius Feb 23 '21 at 23:25
Thanks. I have updated the questions, as the answer from @barboulotte is not what I am looking for. – gfmg1992 Feb 24 '21 at 12:31

score 0 · Answer 2 · answered Feb 23 '21 at 13:25

A proposition:

df <-data.frame(Caught=c(92,134,92,134),
                 Discarded=c(49,47,49,47),
                 Units=c(170,170,220,220),
                 Hours=c(72,72,72,72),
                 Colour=c("red","red","red","red"))

df
#>   Caught Discarded Units Hours Colour
#> 1     92        49   170    72    red
#> 2    134        47   170    72    red
#> 3     92        49   220    72    red
#> 4    134        47   220    72    red


df_results <- data.frame(Caught = sum(unique(df$Caught)),
                         Discarded = sum(unique(df$Discarded)),
                         Units = sum(unique(df$Units)),
                         Hours = unique(df$Hours),
                         Colour = unique(df$Colour))

df_results
#>   Caught Discarded Units Hours Colour
#> 1    226        96   390    72    red

# Created on 2021-02-23 by the reprex package (v0.3.0.9001)

Regards,

Merging specific rows by summing certain columns on grouping variables

2 Answers2