0

I am relatively new to R and I am attempting to write my first multi-step function. Essentially, I want to create a function that takes a directory and searches within that directory to find a certain column (in this case, pollutant). Then find the mean value of that column and remove the NAs. This is what I have so far:

pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {

setwd(directory)

dirdata <- list.files(path=getwd() , pattern='*.csv' , full.names = TRUE) %>% lapply(read_csv) %>% bind_rows

specdata <- dirdata %>% filter(between(ID,min_id,max_id))

polspecdata <- specdata %>% select(pollutant)

polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(pollutant,na.rm=TRUE))
} 

I feel that I am so close, but the result is an error: Warning message:In mean.default(pollutant, na.rm = TRUE) : argument is not numeric or logical: returning NA. I believe the error is due to the column class being col_double. This may be due to dirdata is created from multiple csv files. Any help would be greatly appreciated. Thank you!

This is the data: zipfile_data

Len Greski
  • 10,505
  • 2
  • 22
  • 33
M Doster
  • 37
  • 5
  • 1
    Hi M Doster and welcome to SO. Can you give an example how `pollutant` looks like? – Martin Gal Apr 22 '20 at 22:30
  • Here's an example of a csv file in the directory: head(csv1) # A tibble: 6 x 4 Date sulfate nitrate ID 1 2003-01-01 NA NA 1 2 2003-01-02 NA NA 1 3 2003-01-03 NA NA 1 4 2003-01-04 NA NA 1 5 2003-01-05 NA NA 1 6 2003-01-06 NA NA 1 – M Doster Apr 22 '20 at 22:42
  • Try to apply `mean` to a numeric or logical column, i.e. `mean(pollutant$sulfate,na.rm=TRUE)` or `$nitrate`. – Martin Gal Apr 22 '20 at 22:52
  • Thank you for bearing with me as I try to learn the formatting. Never did figure it out for that comment. I tried changing my function to include that and it returned this error: Error in pollutant$sulfate : $ operator is invalid for atomic vectors. Not sure, but this may introduce a new problem since pollutant is a (character class) variable to pull out data from a specific column (either sulfate or nitrate). – M Doster Apr 22 '20 at 23:02
  • Please add your data to your question by editing the question. Furthermore there are better options for formatting your data. Please give an example for `pollutant` and `polspecdata`. – Martin Gal Apr 22 '20 at 23:06
  • 1
    The error *"invalid for atomic vectors"* is likely from the `$` operator, meaning you are trying to grab a frame-column from something that is not a `data.frame` (did you say "character"?). Honestly, M Doster, whatever classes use `pollutant` and this structure have cycled through SO so many times (each year) that if you search SO for [`[r] pollutant`](https://stackoverflow.com/search?q=%5Br%5D+pollutant), you will find so many others with workable solutions. Don't copy, always do your own work for class, but known-working examples can be very useful. – r2evans Apr 22 '20 at 23:24
  • 1
    @r2evans - with over 4 million people having taken the Johns Hopkins *R Programming* course on Coursera, it's not surprising that there are thousands of questions about `pollutantmean()` on SO. That said, there are probably very few of them that have problems with `dplyr` non-standard evaluation because `dplyr` isn't introduced until the course after *R Programming* in the JHU curriculum. – Len Greski Apr 23 '20 at 03:44

2 Answers2

1

The code in the original post fails because it uses dplyr within a function, but does not use dplyr quoting functions. When we run the code through the RStudio debugger and stop at line 7, we see the following:

enter image description here

dplyr does not render the function argument within mean(pollutant, na.rm = TRUE) as expected, so line 9 fails. The mean() function fails because the pollutant argument renders as a text string, not a column in the polspecdata data frame.

One way to fix the error is to adjust line 9 to explicitly reference the data frame passed from the prior function via the %>% pipe operator, using the [[ form of the extract operator to use the string version of the argument.

polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(.data[[pollutant]],na.rm=TRUE))

Finally, since the function should return the mean to the parent environment, we add a print of the object created in line 9 at the end of the function.

polspecdatamean

Since this is a programming assignment for the Johns Hopkins University R Programming course on Coursera, I won't post a complete answer because that violates the Coursera Honor Code.

Simplifying the solution

Once the data has been filtered in line 5, the function could simply return the mean as follows.

mean(specdata[[pollutant]],na.rm=TRUE)

Conclusions

For this particular assignment, use of dplyr makes the assignment more difficult than it needs to be due to the fact that dplyr uses non-standard evaluation and dplyr isn't even covered in the JHU curriculum until the third course in the sequence.

The code has some other subtle defects whose correction we'll leave as an exercise for the reader. For example, given the assignment requirements, the function should be able to handle the following inputs:

pollutantmean("specdata","sulfate",23) # calc mean for sensor 23
pollutantmean("specdata","nitrate",70:72) # calc mean for sensors 70 - 72 
pollutantmean("specdata","sulfate",c(3,5,7,9)) # calc mean for sensors 3, 5, 7, and 9 
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • Thank you so much! I was getting frustrated over a few days trying to write this code. As for the dplyr, the coursera course isn't the only R coding course I have taken and have been using notes from multiple classes looking for functions to use. I need to review more examples with [[ so that I hope to better use them in the future. Thanks again! – M Doster Apr 23 '20 at 15:25
  • @MDoster - thanks for the feedback. Information on the `[[` form of the extract operator is in my article [Forms of the Extract Operator](https://github.com/lgreski/datasciencectacontent/blob/master/markdown/rprog-extractOperator.md). Good luck with the rest of the *R Programming* course! – Len Greski Apr 23 '20 at 22:08
  • 1
    That was a much better explanation of the operators than the lecture. Thank you for that resource. – M Doster Apr 25 '20 at 19:15
1

Assuming you are passing pollutant variable as string try using the below function.

library(tidyverse)

pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {

  dirdata <- list.files(path=directory, pattern='*.csv' , full.names = TRUE) %>% 
                  map_df(read_csv)
   dirdata %>% 
      filter(between(ID,min_id,max_id)) %>%
      summarise(mean_pollutant= mean(!!sym(pollutant),na.rm=TRUE))
} 

So you can call it as

pollutantmean('/path', 'sulfate', 1, 10)

Using !!sym we evaluate sulfate as column and not as string.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213