4

I am trying to use the summarize function within dplyr to calculate summary statistics using a two argument function that passes a table and field name from a connected database. Unfortunately as soon as I wrap the summarize function with another function the results aren't correct. The end table is a dataframe that does not iterate through each row. I'll show the input/output below:

Summary Statistics Function library(dplyr)

data<-iris
data<- group_by(.data = data,Species)

SummaryStatistics <- function(table, field){
table %>%
summarise(count = n(),
          min = min(table[[field]], na.rm = T),
          mean = mean(table[[field]], na.rm = T, trim=0.05),
          median = median(table[[field]], na.rm = T))
}

SummaryStatistics(data, "Sepal.Length")

Output Table--Incorrect, it's just repeating the same calculation

     Species count   min     mean median
1     setosa    50   4.3 5.820588    5.8
2 versicolor    50   4.3 5.820588    5.8
3  virginica    50   4.3 5.820588    5.8

Correct Table/Desired Outcome--This is what the table should look like. When I run the summarize function outsize of the wrapper function, this is what it produces.

      Species count   min     mean median
 1     setosa    50   4.3 5.002174    5.0
 2 versicolor    50   4.9 5.934783    5.9
 3  virginica    50   4.9 6.593478    6.5

I hope this is easy to understand. I just can't grasp as to why the summary statistics work perfectly outside of the wrapper function, but as soon as I pass arguments to it, it will calculate the same thing for each row. Any help would be greatly appreciated.

Thanks, Kev

AlphaKevy
  • 187
  • 2
  • 14
  • 1
    Hard to diagnose without knowing how you're using the wrapper function. But at a guess, once inside the wrapper function, `summarize` might not know about the grouping factors being used in the calculation. So it would return the same summary for all rows. – jdobres Jan 21 '17 at 19:18
  • @jdobres I'll add the wrapper function. Sorry about that. – AlphaKevy Jan 21 '17 at 19:22
  • 1
    You'll need to use standard evaluation. Read the `dplyr` vignette on it for a better idea. – Jake Kaupp Jan 21 '17 at 22:59

1 Answers1

13

You need to use Non-Standard Evaluation (NSE) to use dplyr functions programmatically alongside lazyeval. The dplyr NSE vignette covers it fairly well.

library(dplyr)
library(lazyeval)

data <- group_by(iris, Species)

SummaryStatistics <- function(table, field){
  table %>%
    summarise_(count = ~n(),
              min = interp(~min(var, na.rm = T), var = as.name(field)),
              mean = interp(~mean(var, na.rm = T, trim=0.05), var = as.name(field)),
              median = interp(~median(var, na.rm = T), var = as.name(field)))
}

SummaryStatistics(data, "Sepal.Length")

# A tibble: 3 × 5
     Species count   min     mean median
      <fctr> <int> <dbl>    <dbl>  <dbl>
1     setosa    50   4.3 5.002174    5.0
2 versicolor    50   4.9 5.934783    5.9
3  virginica    50   4.9 6.593478    6.5
pietrodito
  • 1,783
  • 15
  • 24
Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36
  • 2
    Thank you for answering my question, and more importantly thank you for linking to the document on how to use dplyr programmatically. I was searching for something like this but couldn't track it down. I really appreciate the thoroughness of your answer. Thanks again man. – AlphaKevy Jan 22 '17 at 01:17
  • 3
    The NSE vignette link is dead, it looks like it's been replaced with the [Programming with dplyr vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html). – ropeladder Nov 02 '17 at 17:17