1

I'm trying to read in a small (17kb), simple csv file from EdX.org (for an online course), and I've never had this trouble with readr::read_csv() before. Base-R read.csv() reads the file without generating the problem.

A small (17kb) csv file from EdX.org

library(tidyverse)
df <- read_csv("https://courses.edx.org/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+1T2020+type@asset+block/WHO.csv")
head(df)

Gives this output

#> # A tibble: 6 x 13
#>   Country Region Population Under15 Over60 FertilityRate LifeExpectancy
#>   <chr>   <chr>       <dbl>   <dbl>  <dbl> <chr>                  <dbl>
#> 1 Afghan… Easte…      29825    47.4   3.82 "\r5.4\r"                 60
#> 2 Albania Europe       3162    21.3  14.9  "\r1.75\r"                74
#> 3 Algeria Africa      38482    27.4   7.17 "\r2.83\r"                73
#> 4 Andorra Europe         78    15.2  22.9  <NA>                      82
#> 5 Angola  Africa      20821    47.6   3.84 "\r6.1\r"                 51
#> 6 Antigu… Ameri…         89    26.0  12.4  "\r2.12\r"                75
#> # … with 6 more variables: ChildMortality <dbl>, CellularSubscribers <dbl>,
#> #   LiteracyRate <chr>, GNI <chr>, PrimarySchoolEnrollmentMale <chr>,
#> #   PrimarySchoolEnrollmentFemale <chr>

You'll notice that the column FertilityRate has "\r" added to the values. I've downloaded the csv file and cannot find them there.

Base-R read.csv() reads in the file with no problems, so I'm wondering what the problem is with my usage of the tidyverse read_csv().

head(df$FertilityRate)
#> [1] "\r5.4\r"  "\r1.75\r" "\r2.83\r" NA         "\r6.1\r"  "\r2.12\r"

How can I fix my usage of read_csv() so that: the "\r" strings are not there?

If possible, I'd prefer not to have to individually specify the type of every single column.

Jeremy K.
  • 1,710
  • 14
  • 35
  • `read_csv` isn’t doing anything wrong: the `\r` characters are actually inside the file. I’m not sure why `read.csv` would strip them out but it’s actually doing something wrong if it does that. – Konrad Rudolph Mar 26 '20 at 10:30
  • If you look at your csv file in text format, you will notice that entries for the `FertilityRate` column are in a new line. I assume `read_csv` adds this character to indicate line breaks. – broti Mar 26 '20 at 10:31
  • @KonradRudolph, as far as I can see, the `\r` characters are *not* in the file. They appear where there were line breaks. – broti Mar 26 '20 at 10:32
  • @broti I can assure you that they *are* in the file. I’m looking at them as we speak. – Konrad Rudolph Mar 26 '20 at 10:32
  • @KonradRudolph is there an easy way to get rid of the `\r` characters? `read.csv` seems to do it automatically. – Jeremy K. Mar 26 '20 at 10:33
  • 1
    @JeremyK. - you can use `readr::parse_number()` to convert to numeric. – Ritchie Sacramento Mar 26 '20 at 10:34
  • 1
    @JeremyK. Sure, you can `gsub` them (or use `stringr::str_remove_all`). – Konrad Rudolph Mar 26 '20 at 10:35
  • 1
    @KonradRudolph weird - I am looking at them with Sublime Text and they are not there. – broti Mar 26 '20 at 10:35
  • 1
    @broti No idea how Sublime Text would display `\r` but other text editors (e.g. Vim) show them, as does a hex editor of course. – Konrad Rudolph Mar 26 '20 at 10:40
  • 1
    @KonradRudolph, thanks for clarifying - wasn't aware of that difference between editors. – broti Mar 26 '20 at 10:42

1 Answers1

3

In a nutshell, the characters are inside the file (probably by accident) and read_csv is right to not remove them automatically: since they occur within quotes, this by convention means that a CSV parser should treat the field as-is, and not strip out whitespace characters. read.csv is wrong to do so, and this is arguably a bug.

You can strip them out yourself once you’ve loaded the data:

df = mutate_if(df, is.character, ~ stringr::str_remove_all(.x, '\r'))

This seems to be good enough for this file, but in general I’d be wary that the file might be damaged in other ways, since the presence of these characters is clearly not intentional, and the file follows no common file ending convention (it’s neither a conventional Windows nor Unix file).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214