0

I have a collection of data frames, df_i, representing the ith visit of a set of patients to a hospital. I'd like to summarize each of the data frames to determine the number of men, women and total patients at the ith visit. While I can solve this, my solution is clumsy. Is there a simpler way to get the final dataframe that I want? Example follows:

df_1 <- data.frame(
  ID     = c(rep("A",4), rep("B",3), rep("C",2), "D"),
  Dates  = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-01-10"), by = "day"),
  Sex    = c(rep("Male",4), rep("Male",3), rep("Female",2), "Female"),
  Weight = seq(100, 190, 10),
  Visit  = rep(1, 10)
)

df_2 <- data.frame(
  ID     = c(rep("A",4), rep("B",3), rep("C",2)),
  Dates  = seq.Date(from = as.Date("2020-02-01"), to = as.Date("2020-02-9"), by = "day"),
  Sex    = c(rep("Male",4), rep("Male",3), rep("Female",2)),
  Weight = seq(100, 180, 10),
  Visit  = rep(2, 5)
)

df_3 <- data.frame(
  ID     = c(rep("A",4), rep("B",3)),
  Dates  = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-03-07"), by = "day"),
  Sex    = rep("Male",7),
  Weight = seq(140, 200, 10),
  Visit  = rep(3, 7)
)

I'm looking to generate the following result:

> df_sum
  Visit Patients Men Women
1     1        4   2     2
2     2        3   2     1
3     3        2   2     0

I can do this in a very clumsy way: First create a temporary data frame that summarizes the information in df_1

df_tmp <- df_1 %>%
            group_by(ID) %>%
            filter(Dates == min(Dates)) %>%
            summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))
> df_tmp
# A tibble: 4 x 4
  ID        n   Men Women
  <chr> <int> <int> <int>
1 A         1     1     0
2 B         1     1     0
3 C         1     0     1
4 D         1     0     1

Next, sum each of the columns in df_tmp to create the first row for the summary column.

r1 <- c(sum(df_tmp$n), sum(df_tmp$Men), sum(df_tmp$Women))

Repeat for the second and third data frames. Finally rbind the rows together to create the summary data frame. While this works, it is extremely clumsy, and doesn't generalize to the case when I have a variable number of visits. Would someone kindly point me to a mmore elegant solution to my problem?

Many thanks in advance

Thomas Philips

Thomas Philips
  • 935
  • 2
  • 11
  • 22

2 Answers2

2

Could also make into a tibble with bind_rows:

library(tidyverse)
bind_rows(df_1, df_2, df_3, .id = "day") %>%
  group_by(day, ID) %>%
  slice_min(Dates) %>%
  group_by(day) %>%
  summarize(n = n(), Men = sum(Sex == "Male"), Women = sum(Sex == "Female"))

Result

# A tibble: 3 x 4
  day       n   Men Women
* <chr> <int> <int> <int>
1 1         4     2     2
2 2         3     2     1
3 3         2     2     0
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • This is simple and clean. I suppose `slice_min` could also be implemented as `filter(Dates == min(Dates))`, I also just chose not to add an id column, but to just use `Visit` to group by. Thank you. – Thomas Philips Feb 28 '21 at 03:12
1

Put the data in a list and iterate over them through map so that you don't have to repeat the code for each dataframe. Using janitor::adorn_totals you can add a new row in the output with the total and get the data in wide format.

library(tidyverse)

list_df <- list(df_1, df_2, df_3)

map_df(list_df, ~.x %>% 
              group_by(ID) %>%
              filter(Dates == min(Dates)) %>%
              ungroup %>%
              count(Sex) %>%
              janitor::adorn_totals(name = 'Patients'), .id = 'Visit') %>%
  pivot_wider(names_from = Sex, values_from = n, values_fill = 0)

#  Visit Female  Male Patients
#  <chr>  <int> <int>    <int>
#1 1          2     2        4
#2 2          1     2        3
#3 3          0     2        2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213