1

I'm seeing unexpected behavior from the excellent readr::read_csv(). When trying to read a tibble containing a character vector of strings that all begin with "Inf" (e.g. "Inform", "Information"), read_csv() incorrectly reads it as a numeric Inf, instead of correctly reading it as a string. The base read.csv() correctly reads it as a string though. If the character vector contains at least one string that does not begin with "Inf" however (e.g. "Indigo"), then read_csv() will correctly read the vector as a string. Read_csv() will also correctly read the vector as a string if the col_types argument specifies it as a character vector, but that requires manual checks/edits.

Do others have this issue, and if so is there an argument for read_csv() or other workaround that will allow read_csv() to reliably read character vectors that happen to contain only strings beginning with "Inf"? It seems problematic to have to continually check all character vectors first and then manually specify col_types if all the strings happen to begin with "Inf".

Thanks very much, and apologies if I'm just missing something.

suppressPackageStartupMessages(library(tidyverse))

#################################################

# save tibble with character vector containing only strings that begin with "Inf"
test_1 <- tibble(x = c("Inform", "Information"))
test_1 %>% glimpse()
#> Rows: 2
#> Columns: 1
#> $ x <chr> "Inform", "Information"
test_1 %>% write_csv(file = "test_1.csv")

# read_csv() seems to convert the strings into numeric Inf because they all begin with "Inf"
# however, if col_types is manually specified as col_character, then read_csv() correctly reads the vector as a string
read_csv(file = "test_1.csv")
#> Rows: 2 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> dbl (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1   Inf
#> 2   Inf
read_csv(file = "test_1.csv", lazy = FALSE)
#> Rows: 2 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> dbl (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1   Inf
#> 2   Inf
read_csv(file = "test_1.csv", col_types = cols(x = col_character()))
#> # A tibble: 2 x 1
#>   x          
#>   <chr>      
#> 1 Inform     
#> 2 Information

# read.csv() correctly reads the vector as a string
read.csv(file = "test_1.csv") %>% glimpse()
#> Rows: 2
#> Columns: 1
#> $ x <chr> "Inform", "Information"

# read_csv() correctly reads similar character vectors if they contain at least one string that does not begin with "Inf"
test_2 <- tibble(x = c("Inform", "Indigo", "Information")) %>% write_csv(file = "test_2.csv")
read_csv(file = "test_2.csv") %>% glimpse()
#> Rows: 3 Columns: 1
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (1): x
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 3
#> Columns: 1
#> $ x <chr> "Inform", "Indigo", "Information"

#################################################

# get version info
packageVersion("tidyverse")
#> [1] '1.3.1'
version
#>                _                           
#> platform       x86_64-w64-mingw32          
#> arch           x86_64                      
#> os             mingw32                     
#> system         x86_64, mingw32             
#> status                                     
#> major          4                           
#> minor          1.1                         
#> year           2021                        
#> month          08                          
#> day            10                          
#> svn rev        80725                       
#> language       R                           
#> version.string R version 4.1.1 (2021-08-10)
#> nickname       Kick Things

Created on 2021-10-22 by the reprex package (v2.0.1)

sdevine188
  • 338
  • 3
  • 8

1 Answers1

2

Look, {readr} has heuristics to guess, the column-types, but you're supposed to know what you're trying to load. Here's how I usually do it.

df <- readr::read_csv("somefile.csv");

spec(df) 

This should tell you what it has guessed the different headers should be set as. Copy that, and paste it in the original call, where you adjust what you need:

df <- readr::read_csv("somefile.csv", col_types = cols(
   a = col_character(),
   b = col_numeric()
   #etc
   ));
Mossa
  • 1,656
  • 12
  • 16
  • Agreed that manually inspecting the data to be loaded is best practice. It's a small issue I guess, but it'd be nice if maybe the read_csv() heuristic could look at the length of strings, or even just the fourth character, before deciding to read the string vector as a numeric Inf. Or at least have an optional argument to do so. – sdevine188 Oct 22 '21 at 16:58
  • It does have multiple `col_*` values for all sorts of parsing, plus it is asking to inspect the `spec()` in detail as soon as you load something, especially due to potential "inconsistencies" with what it have assumed. If you do what I wrote, you'll get less noise from the `read_*`-call. – Mossa Oct 22 '21 at 17:00
  • True, I guess it just seems like if base read.csv() handles it correctly, then read_csv would as well... – sdevine188 Oct 22 '21 at 17:07
  • 2
    Update: this bug should be fixed in the next release of vroom https://github.com/tidyverse/readr/issues/1319 – sdevine188 Jan 25 '22 at 22:38