1

I am new to R. I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.

Data sample:

  Date       sulfate nitrate    ID
  <date>       <dbl>   <dbl> <dbl>
1 2003-10-06    7.21   0.651     1
2 2003-10-12    5.99   0.428     1
3 2003-10-18    4.68   1.04      1
4 2003-10-24    3.47   0.363     1
5 2003-10-30    2.42   0.507     1
6 2003-11-11    1.43   0.474     1
... 

To read the files and create a data.frame, I wrote this function:

pollutantmean <- function (pollutant, id = 1:332) {
       #creating a data frame from several files
  file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE) 
  read_file_m <- lapply(file_m, read_csv)
  df_1 <- bind_rows(read_file_m)

        # delete NAs
  df_clean <- df_1[complete.cases(df_1),] 
  
        #select rows according to id
  df_asid_clean <- filter(df_clean, ID %in% id)
  
        #count the mean of the column
  mean_result <- mean(df_asid_clean[, pollutant])
  mean_result

However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not. Throughout the reading I get this message:

<...>
Parsed with column specification:
cols(
  Date = col_date(format = ""),
  sulfate = col_double(),
  nitrate = col_logical(),
  ID = col_double()
)
Warning: 41 parsing failures.
 row     col           expected actual               file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383  'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355  'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469  'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144  'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details. 

I tried to change the column class by writing df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate]) , after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.

What is wrong here, and how could I solve it? Would appreciate your help!

UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.

MaPe
  • 33
  • 6
  • 1
    My experience with importing CSV or any text data automatically is that is often fails to guess the types. You will have to provide the correct column types to read_csv. Trying to correct this after the import is useless, the damage has already been done. –  Aug 14 '20 at 08:53
  • 1
    Try specifying the column types manually with the `col_types` argument to `read_csv`. There are examples in the help of `read_csv`, if you are using Rstudio, you can interactively generate the code by reading in one file via the files panel and copying the genearated code. – snaut Aug 14 '20 at 08:56
  • 2
    I believe `read_csv()` says in their documentation that they use the first 1,000 rows to guess type. If the first 1,000 are missing, it will default to logical. You can tell it to use more rows to guess type, manually specify type on import, use `type.convert()` (with `as.is = TRUE`) or `readr::type_convert()` later, or use a different function to read-in data. The method used to guess type in `data.table::fread()` works different than `read_csv()`, and I prefer it in 90% of cases. If you use `fread()` you may want to specify `data.table = FALSE`, and you will also have to coerce date columns. – Andrew Aug 14 '20 at 10:28
  • In my opinion, I would try to fix the problem on import rather than use `type.convert()` or `type_convert()`. If you like `read_csv()` then increase the `guess_max` argument (which may slow it down a little). Otherwise I would try `fread()` but you'll need to fix dates and you may want to coerce it using `as_tibble()` if you really want a tibble vs a data.table or data.frame. For me, R functions usually do a great job guessing type--it helps to know exactly what the functions are doing though. The documentation is worth the effort for both `readr::read_*()`) and `data.table::fread()` – Andrew Aug 14 '20 at 10:38
  • `read_csv()` includes an argument, `col_types=` that allows one to specify the data types for each column being read, as I illustrate in my answer. – Len Greski Aug 17 '20 at 01:30

2 Answers2

1

The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:

pollutantmean <- function (directory,pollutant,id=1:332){
     require(readr)
     require(dplyr)
     file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id] 
     read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
                                                           col_double(),col_integer()))
     # rest of code goes here. Since I am a Community Mentor in the
     # JHU Data Science Specialization, I am not allowed to post
     # a complete solution to the programming assignment 
}

Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.

With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.

> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72) 
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833

Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:

 read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
                                                       col_double(),col_integer()))})

NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().

Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • Thank you, that makes sense, why I get "input nor logical or numeric" when I use read_csv even with all the suggestions to change the column type, while read.csv works just fine! – MaPe Aug 17 '20 at 08:08
  • @MaPe - you're welcome. If you found my answer helpful, please accept and upvote it. – Len Greski Aug 17 '20 at 09:58
0

With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.

Regarding the column types this can then be sen manually in the col_type argument:

col_type=cols(Date-col_date,sulfate=...)

alejandro_hagan
  • 843
  • 2
  • 13