0

I'm new to R, and I'm trying to write a function that will add the entries of a data frame column by row, and return the data frame with

  1. a column of the new row of sums
  2. that column named.

Here's a sample df of my data:

Ethnicity <- c('A', 'B', 'H', 'N', 'O', 'W', 'Unknown')
Texas <- c(2,41,56,1,3,89,7)
Tenn <- c(1,9,2,NA,1,32,3)

When I directly try the following code, the columns are summed by row as desired:

new_df <- df %>% rowwise() %>%
                 mutate(TN_TX = sum(Tenn, Texas, na.rm = TRUE))  

new_df

But when I try to use my function code, rowwise() seems not to work. My function code is:

df.sum.col <- function(df.in, col.1, col.2)  {

if(is.data.frame(df.in) != TRUE){               #warning if first arg not df
  warning('df.in is not a dataframe')}

if(is.numeric(col.1) != TRUE){                
  warning('col.1 is not a numeric vector')}     

if(is.numeric(col.2) != TRUE){
  warning('col.2 is not a numeric vector')}     #warning if col not numeric 


df.out <- rowwise(df.in) %>%
                 mutate(name = sum(col.1, col.2, na.rm = TRUE))

df.out 
}


bad_df <- df.sum(df,Texas, Tenn)

This results in

bad_df

.

I don't understand why the core of the function works outside it but not within. I also tried piping df.in to rowsum() like this:

f.out <- df.in %>% rowwise() %>%
                 mutate(name = sum(col.1, col.2, na.rm = TRUE))

But that doesn't resolve the problem.

As far as naming the new column, I tried doing so by adding the name as an argument, but didn't have any success. Thoughts on this?

Any help appreciated!

  • I don't think you can pass arguments to a `mutate` function like that. The answer you are getting is `sum(Texas,Tenn,na.rm=TRUE)`, adding up the vectors in your global environment. If you do `rm(c("Texas","Tenn"))` first, you will see that your function doesn't work at all. I think you need to take a look at the non-standard evaluation for dplyr vignette `vignette("nse")` – thelatemail Mar 14 '17 at 22:43
  • passing rowwise() to mutate() works outside the function in the first example above. I ran rm(Tenn, Texas) and I'm still getting the same result. Are you suggesting I use mutate_() instead? I appreciate the link, but it's a little over my head still. – AndyDufresne Mar 14 '17 at 23:21
  • Error in rm(c("Texas", "Tenn")) : ... must contain names or character strings is the error I get when running rm(c("Texas","Tenn")). I tried mutate_(), but my output is the same. I appreciate your help, but I'm afraid I'm at a loss. Why would the same line of code behave differently inside a function? Is it something about the variables that are stand-ins for the arguments? – AndyDufresne Mar 15 '17 at 00:51
  • Apologies, I got that last part wrong `rm(list=c("Texas","Tenn"))` would do it too. But yes, a character string and an unquoted expression representing an object are two different things. – thelatemail Mar 15 '17 at 01:06

1 Answers1

1

As suggested by @thelatemail, it's down to non-standard evaluation. rowwise() ha nothing to do with it. You need to rewrite your function to use mutate_. It can be tricky to understand, but here's one version of what you're trying to do:

library(dplyr)
df <- tibble::tribble(
  ~Ethnicity, ~Texas, ~Tenn,
  "A", 2, 1,
  "B", 41, 9,
  "H", 56, 2,
  "N", 1, NA,
  "O", 3, 1,
  "W", 89, 32,
  "Unknown", 7, 3
)

df.sum.col <- function(df.in, col.1, col.2, name)  {

  if(is.data.frame(df.in) != TRUE){               #warning if first arg not df
    warning('df.in is not a dataframe')}

  if(is.numeric(lazyeval::lazy_eval(substitute(col.1), df.in)) != TRUE){                
    warning('col.1 is not a numeric vector')}     

  if(is.numeric(lazyeval::lazy_eval(substitute(col.2), df.in)) != TRUE){
    warning('col.2 is not a numeric vector')}     #warning if col not numeric 

  dots <- setNames(list(lazyeval::interp(~sum(x, y, na.rm = TRUE),
                                         x = substitute(col.1), y = substitute(col.2))),
                   name)

  df.out <- rowwise(df.in) %>%
    mutate_(.dots = dots)

  df.out 
}

In practice, you shouldn't need to use rowwise at all here, but can use rowSums, after selecting only the columns you need to sum.

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • Will this code only work with tibbles? I tried running it with my data and sample df, and it returns 'object not found' in reference to the dataframes. – AndyDufresne Mar 15 '17 at 14:17
  • @AndyDufresne There was an error in the function where I'd put in `df` rather than `df.in` in the function, now corrected. However, it should work with a normal `data.frame` just as well as a `tibble`. Could you provide at least `dput(head(df))` of the data you're using? – Nick Kennedy Mar 15 '17 at 20:27
  • Can give me some context for `dput(head((df))`. – AndyDufresne Mar 15 '17 at 22:33
  • @AndyDufresne I'm asking you to give me the top few rows of your dataframe in a way that is easily entered into R so that I can reproduce your error. It works fine on my sample data using either `df.col.sum(df, Texas, Tenn)` or `df.col.sum(as.data.frame(df), Texas, Tenn)` – Nick Kennedy Mar 15 '17 at 22:35
  • It works! Incredible! The data I'm using is really wide. I took the dataframe from [WaPo's GitHub](https://github.com/washingtonpost/data-police-shootings/blob/master/fatal-police-shootings-data.csv) and selected only the ethnicity(race) and state, then spread by state: – AndyDufresne Mar 15 '17 at 22:47
  • `fatal_shootings <- read_csv('fatal_shootings.csv', col_names = TRUE) state_shootings <- count(fatal_shootings, race, state) shootings <- spread(state_shootings, key = state, value = n)` – AndyDufresne Mar 15 '17 at 22:47
  • I'm trying to understand what's going on with the code you wrote, not least because I want to be able to add an argument that will be the name of the new row. – AndyDufresne Mar 15 '17 at 22:51
  • @AndyDufresne modified to allow setting of name. I still find it takes me a bit of time to get these types of non-standard evaluation code right. In essence, you're substituting in unevaluated expressions into a formula which then get passed to the `_` version of `mutate`. – Nick Kennedy Mar 15 '17 at 22:54
  • Nick, can you recommend additional reading on NSE? The vignette thelatemail posted was above my pay grade, so to speak. I'm (obviously) a novice programmer. – AndyDufresne Mar 15 '17 at 23:01
  • @AndyDufresne Have a look at [Hadley's book](http://adv-r.had.co.nz/Computing-on-the-language.html). There are also some alternative ways of using `mutate_`. For example, you could replace the `dots` line with `dots <- setNames(list(sprintf("sum(%s, %s, na.rm = TRUE)", col.1, col.2)), name)`, but you'd then have to call `df.col.sum` using the names of the columns in quotes (e.g. `df.col.sum(df, "Texas", "Tenn", "TT")`. However, a formula is the preferred option. – Nick Kennedy Mar 15 '17 at 23:09