Combining data under different factor levels while retaining original levels

Question

I would like to have a tidyverse solution for the following problem. In my dataset, I have data on various factor levels. I would like to create a new factor level "Total" that is the sum of all values Y at existing factor levels of X. This can be done, for example, with:

mutate(Data, X = fct_collapse(X, Total = c("A", "B", "C", "D"))) %>%
  group_by(X) %>% 
  summarize(Y = sum(Y))

However, this also necessarily overwrites the original factor levels. I would have to combine the original dataset with the new collapsed dataset in an additional step.

One solution I have used in the past to retain the original levels is to bring data in the wide format and proceed with rowwise() and mutate() to create a new variable with the "Total" and then reshape back to long.

spread(Data, key = X, value = Y) %>%
  rowwise() %>%
  mutate(Total = sum(A, B, C, D)) %>%
  gather(1:5, key = "X", value = "Y")

However, I am very unhappy with this solution since using rowwise() is not considered good practice. It would be great if you could point me to an available alternative solution how to combine data under different factor levels while retaining original levels.

Minimal reproducible example:

Data<-data.frame(
X = factor(c("A", "B", "C", "D")),
Y = c(1000, 2000, 3000, 4000))

Expected result:

# A tibble: 5 x 2
  X         Y
  <chr> <dbl>
1 A      1000
2 B      2000
3 C      3000
4 D      4000
5 Total 10000

`df %>% janitor::adorn_totals("row")` does this work for you? It requires loading an additional package and `Total` will not be added as a factor. — M--, Feb 12 '19 at 16:50
In principle, the function does exactly what I've been looking for (with the slight qualification that it would be great to have "Total" as a factor level). However, I would indeed prefer a tidyverse solution to the problem. But I guess would then have to define my own function e.g. based on @Rui Barradas suggestions. — miwin, Feb 12 '19 at 18:01
I do something similar to this often enough that I added a function `bind_self` to a package I wrote for work: https://github.com/camille-s/camiller/blob/master/R/bind_self.R It's a little beyond the scope of an SO answer — camille, Feb 12 '19 at 18:48
@miwin if you write your own function, is it really a `tidyverse` solution? I am asking conceptually, not arguing about what you may prefer, which is, obviously, totally up to you. — M--, Feb 12 '19 at 19:19
@M-M Your point is well taken :) What I meant was to have an answer that works without additional packages. But your solution is certainly the most concise one and very helpful. — miwin, Feb 12 '19 at 20:47

M-- · Answer 1 · 2021-06-29T02:29:57.957

5

Using janitor library, this would be straightforward.

Data %>% janitor::adorn_totals("row") %>% mutate(X=factor(X))

  # X     Y
  # A     1000
  # B     2000
  # C     3000
  # D     4000
  # Total 10000

Looking at the output structure:

str(output)

# 'data.frame': 5 obs. of  2 variables:
#  $ X: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ Y: num  1000 2000 3000 4000 10000

edited Jun 29 '21 at 02:29

answered Feb 12 '19 at 19:32

M--

25,431
8
61
93

1

I didn't know about package `janitor`, it's always good to know simple solutions. – Rui Barradas Feb 12 '19 at 19:52

Rui Barradas · Answer 2 · 2021-06-28T22:39:02.127

4

Using the suggestion in @M--'s first version of his comment to the question, now edited, I have added bind_rows.
I have also changed the input dataset a bit. Following the OP's and @camille's comment, this dataset has a factor level "Z" but keeps the original order and adds level "Total" at the end.

Data <- data.frame(
  X = factor(c("A", "B", "C", "Z")),
  Y = c(1000, 2000, 3000, 4000))

Data %>%
  mutate(lvl = levels(X),
         X = fct_collapse(X, Total = c("A", "B", "C", "Z")),
         X = as.character(X)) %>%
  bind_rows(mutate(Data, X = as.character(X)), .) %>%
  mutate(X = factor(X, levels = c(lvl, "Total"))) %>%
  group_by(X) %>% 
  summarize(Y = sum(Y)) -> d

d
## A tibble: 5 x 2
#  X         Y
#  <fct> <dbl>
#1 A      1000
#2 B      2000
#3 C      3000
#4 Z      4000
#5 Total 10000

Check the output factor levels.

levels(d$X)
#[1] "A"     "B"     "C"     "Z"     "Total"

edited Jun 28 '21 at 22:39

answered Feb 12 '19 at 16:54

Rui Barradas

70,273
8
34
66

1

I'd add a step with `mutate(X = as_factor(X))` before grouping and summarizing, in order to keep Total in an order that makes sense. To see what I mean, change the level `D` to `Z`; `summarize` uses alphabetical order if the grouping variable isn't a factor – camille Feb 12 '19 at 17:04
Thanks a lot! The extra step with `bind_rows()` fits nicely in the flow. My only concern is using a workflow that produces warnings by default. Should this be a concern or can it be ignored? – miwin Feb 12 '19 at 17:53
1

@camille You are right, I have changed the code to have level `"Total"` after the original levels. – Rui Barradas Feb 12 '19 at 18:33
2

Besides the answer itself, giving credits to others (yeah, we may have figured out whatever they said by ourselves but they said it first anyway) is a very rare thing in our nerdy community. If I could, I would have given +2 instead of +1. One for the answer, one for the ethics. Cheers. – M-- Feb 12 '19 at 19:25

score 1 · Answer 3 · answered Jun 28 '21 at 22:49

This solution can also be used in this case:

library(dplyr)

Data %>%
  add_row(X = "Total", Y = sum(.$Y)) %>%
  mutate(X = factor(X))

      X     Y
1     A  1000
2     B  2000
3     C  3000
4     D  4000
5 Total 10000

Data %>%
  add_row(X = "Total", Y = sum(.$Y)) %>%
  mutate(X = factor(X)) %>%
  {levels(.$X)}

[1] "A"     "B"     "C"     "D"     "Total"

Combining data under different factor levels while retaining original levels

3 Answers3