1

I have a data frame that has some records that are duplicated and I need to aggregate the duplicates so there is a unique record per row.

An example:

Col1    Col2    Col3    Col4
A       0.170   83     0.878
B       0.939   103    0.869
C       0.228   80     0.935
D       0.566   169    0.851
D       0.566   137    0.588
E       0.703   103    0.636

I need to weight the average of Col4 with Col3, and sum Col3. So my result would be:

Col1    Col2    Col3    Col4
A      0.17     83     0.878
B      0.939    103    0.869
C      0.228    80     0.935
D      0.566    306    0.733
E      0.703    103    0.636

Usually I would use the aggregate function but I can't seem to find a solution to include two different function types. Is there another way I can accomplish this? I am effectively ignoring Col 2 since the granularity before merging with the data that brought in Col3 and Col4 was one record per row, and now it is being duplicated.

Thank you!!

pav
  • 59
  • 1
  • 12

2 Answers2

3

Using dplyr, you can use group_by to keep all unique rows of "Col1" and then pass all your different function into summarise. With your example, it can be something like that.

NB: To calculate weighted.mean of Col4 by Col3, you need to pass this function before calculating the sum of Col3, otherwise length of Col4 and Col3 will differ.

You can then reorganize your dataframe in the correct order using select:

library(dplyr)
df %>% group_by(Col1) %>%
  summarise(Col2 = mean(Col2),
            Col4 = weighted.mean(Col4,Col3),
            Col3 = sum(Col3)) %>%
  select(Col1,Col2,Col3,Col4)

# A tibble: 5 x 4
  Col1   Col2  Col3  Col4
  <chr> <dbl> <int> <dbl>
1 A     0.17     83 0.878
2 B     0.939   103 0.869
3 C     0.228    80 0.935
4 D     0.566   306 0.733
5 E     0.703   103 0.636

Data

structure(list(Col1 = c("A", "B", "C", "D", "D", "E"), Col2 = c(0.17, 
0.939, 0.228, 0.566, 0.566, 0.703), Col3 = c(83L, 103L, 80L, 
169L, 137L, 103L), Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588, 
0.636)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x561706072cc0>)
dc37
  • 15,840
  • 4
  • 15
  • 32
  • Thank you! I was not able to get the code to actually apply. It said that it ran but the df appeared exactly the same as the original. I tried the base solution and this one worked! I'm still not sure what went wrong with the dplyr... – pav Feb 03 '20 at 02:08
  • 1
    No problem ;) the code will calculate the new column values but won't change the `df`. If you want to pass these changes as permanent, you need to do `df <- df %>% ...`. But if it works with base r, that's good too. – dc37 Feb 03 '20 at 02:16
  • Thank you! Let me try that... I must have forgotten that step :/ With the other base code, it seemed to create a data frame (as per the class ()) but they are lists, and I am not able to export it to txt. :( I will try this again with saving it! – pav Feb 03 '20 at 02:40
  • 2
    You're welcome. I think you should at least upvote @hello friend's answer because it provide an useful answer ;) – dc37 Feb 03 '20 at 02:46
  • Thank you! Sorry I'm still getting used to how stack overflow works for posting. Just did! – pav Feb 04 '20 at 03:03
2

Base R solution:

aggregated_df <- data.frame(do.call("rbind", lapply(split(df, df$Col1), function(x){
        list(Col1 = unique(x$Col1), Col2 = mean(x$Col2), Col3 = sum(x$Col3), 
                   Col4 = weighted.mean(x$Col4, x$Col3))
      }
    )
  ),
stringsAsFactors = FALSE)

Data:

df <-
  structure(
    list(
      Col1 = c("A", "B", "C", "D", "D", "E"),
      Col2 = c(0.17,
               0.939, 0.228, 0.566, 0.566, 0.703),
      Col3 = c(83L, 103L, 80L,
               169L, 137L, 103L),
      Col4 = c(0.878, 0.869, 0.935, 0.851, 0.588,
               0.636)
    ),
    row.names = c(NA,-6L),
    class = c("data.frame"
    ))
hello_friend
  • 5,682
  • 1
  • 11
  • 15
  • Thank you! I was not able to export the result of this however, due to the result being lists. Even though the class was a data frame. I tried to troubleshoot but may need to spend more time. – pav Feb 03 '20 at 02:47
  • @pav The result should be a single data frame due to the rbind function being applied over the list. Please try View(aggregated_df) and see if you can see the desired result. – hello_friend Feb 05 '20 at 06:19
  • Yes thanks! I can see the results but even though it's a data frame, when I merge I am having an issue due to the list. – pav Feb 06 '20 at 17:03