Conditional summing across columns with dplyr

Question

I have a data frame with four habitats sampled over eight months. Ten samples were collected from each habitat each month. The number of individuals for species in each sample was counted. The following code generates a smaller data frame of a similar structure.

# Pseudo data
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

I want to sum the total number of individuals by month, across all species sampled. I'm using ddply (preferred) but I'm open to other suggestions.

The closest I get is to add together the sum of each column, as shown here.

library(plyr)
ddply(df, ~ Month, summarize, tot_by_mon = sum(Species1) + sum(Species2) + sum(Species3))

#   Month tot_by_mon
# 1   Jan         84
# 2   Feb         92
# 3   Mar         67

This works, but I wonder if there is a generic method to handle cases with an "unknown" number of species. That is, the first species always begins in the 4th column but the last species could be in the 10th or 42nd column. I do not want to hard code the actual species names into the summary function. Note that the species names vary widely, such as Doryflav and Pheibica.

A base R solution that returns a single column matrix with months as row names is `rowsum(rowSums(df[3:6]), df$Month)`. — lmo, Oct 31 '17 at 19:37

score 4 · Accepted Answer · answered Nov 01 '17 at 14:15

4

Similar to @useR's answer with data.table's melt, you can use tidyr to reshape with gather:

library(tidyr)
library(dplyr)
gather(df, Species, Value, matches("Species")) %>% 
  group_by(Month) %>% summarise(z = sum(Value))

# A tibble: 3 x 2
   Month     z
  <fctr> <int>
1    Jan    90
2    Feb    81
3    Mar    70

If you know the columns by position instead of a pattern to be "matched"...

gather(df, Species, Value, -(1:3)) %>% 
  group_by(Month) %>% summarise(z = sum(Value))

(Results shown using @akrun's set.seed(123) example data.)

answered Nov 01 '17 at 14:15

Frank

66,179
8
96
180

I tried Psidom's approach of `gather(df, Species, Value, 4:ncol(.)) %>% group_by(Month) %>% summarise(z = sum(Value))` which hits a bug "fixed" by installing an additional package: https://github.com/tidyverse/tidyr/issues/350#issuecomment-326920323 – Frank Nov 01 '17 at 14:17
1

I ran into a "bug" with your solution until I learned of conflicts between `tidyverse` (inc. `tidyr`) and the `plyr` package. Without realizing the conflict, I had loaded `plyr` after `tidyverse`. `Gather` returned a single total value instead of grouped by habitats. Some reading revealed the problem. I made a few changes elsewhere and no longer need `plyr` for my current project. Your solution works as advertised. Thanks! – Michael S Taylor Nov 01 '17 at 15:59

acylam · Answer 2 · 2017-11-01T13:36:03.503

Here's another solution with data.table without needing to know the names of the "Species" columns:

library(data.table)

DT = melt(setDT(df), id.vars = c("Habitat", "Month", "Sample"))    
DT[, .(tot_by_mon=sum(value)), by = "Month"]

or if you want it compact, here's a one-liner:

melt(setDT(df), 1:3)[, .(tot_by_mon=sum(value)), by = "Month"]

Result:

   Month tot_by_mon
1:   Jan         90
2:   Feb         81
3:   Mar         70

Data: (Setting seed to make example reproducible)

set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

Psidom · Answer 3 · 2017-10-31T19:31:31.317

2

Suppose Speciess columns all start with Species, you can select them by the prefix and sum using group_by %>% do:

library(tidyverse)
df %>% 
    group_by(Month) %>% 
    do(tot_by_mon = sum(select(., starts_with('Species')))) %>% 
    unnest()

# A tibble: 3 x 2
#   Month tot_by_mon
#  <fctr>      <int>
#1    Jan         63
#2    Feb         67
#3    Mar         58

If column names don't follow a pattern, you can select by column positions, for instance if Species columns go from 4th to the end of data frame:

df %>% 
    group_by(Month) %>% 
    do(tot_by_mon = sum(select(., 4:ncol(.)))) %>% 
    unnest()

# A tibble: 3 x 2
#   Month tot_by_mon
#  <fctr>      <int>
#1    Jan         63
#2    Feb         67
#3    Mar         58

edited Oct 31 '17 at 19:31

answered Oct 31 '17 at 19:26

Psidom

209,562
33
339
356

1

Or use tidyr to avoid do(): `gather(df, Species, Value, matches("Species")) %>% group_by(Month) %>% summarise(z = sum(Value))` – Frank Oct 31 '17 at 19:28
Unfortunately, as I noted in the final sentence of my question, the names of the species do not fit a convenient pattern. – Michael S Taylor Oct 31 '17 at 19:28
2

@Mike See `?dplyr::select` -- there are many options for selecting the columns you want. Eg, `gather(df, Species, Value, -(1:3)) %>% group_by(Month) %>% summarise(z = sum(Value))` – Frank Oct 31 '17 at 19:29
You can alternatively select by column positions, see the update. Or use negative numbers to exclude columns as @Frank's comment. – Psidom Oct 31 '17 at 19:33
@Frank I like the `gather` method. Would you consider developing your comment into an answer? – Michael S Taylor Nov 01 '17 at 08:44

score 2 · Answer 4 · edited Jun 20 '20 at 09:12

Here is another option with data.table without reshaping to 'long' format

library(data.table)
setDT(df)[, .(tot_by_mon = Reduce(`+`, lapply(.SD, sum))), Month,
          .SDcols = Species1:Species3]
#   Month tot_by_mon
#1:   Jan         90
#2:   Feb         81
#3:   Mar         70

Or with tidyverse, we can also make use of map functions which would be efficient

library(dplyr)
library(purrr)
df %>% 
  group_by(Month) %>%
  nest(starts_with('Species')) %>%
  mutate(tot_by_mon = map_int(data, ~sum(unlist(.x)))) %>% 
  select(-data)
# A tibble: 3 x 2
#    Month tot_by_mon
#   <fctr>      <int>
#1    Jan         90
#2    Feb         81
#3    Mar         70

data

set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2),
                        levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

Conditional summing across columns with dplyr

4 Answers4

data