8

I just noticed that read_csv() somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv() does not do that. So, what does read_csv() use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max argument?

library(tidyverse)
set.seed(123)
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read.csv("data/titanic.csv")
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read_csv("data/titanic.csv")
rnorm(1)
#[1] 1.239496

EDIT:

  1. As suggested by rawr's comment, I tried specifying col_types and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?
set.seed(123)
dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc"))
rnorm(1)
#[1] -0.5604756
  1. Since a lot of people asked about the R and readr version, here is my session info.
library(readr)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2021-06-10                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version     date       lib source                            
#>  cli           2.5.0       2021-04-26 [1] CRAN (R 4.0.3)                    
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.0.4)                    
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.3)                    
#>  ellipsis      0.3.2       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.3)                    
#>  fansi         0.5.0       2021-05-25 [1] CRAN (R 4.0.5)                    
#>  fastmap       1.1.0       2021-01-25 [1] CRAN (R 4.0.5)                    
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.3)                    
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.3)                    
#>  highr         0.9         2021-04-16 [1] CRAN (R 4.0.5)                    
#>  hms           1.0.0       2021-01-13 [1] CRAN (R 4.0.5)                    
#>  htmltools     0.5.1.9003  2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#>  knitr         1.33        2021-04-24 [1] CRAN (R 4.0.5)                    
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 4.0.4)                    
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.3)                    
#>  pillar        1.6.1       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.3)                    
#>  ps            1.6.0       2021-02-28 [1] CRAN (R 4.0.5)                    
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.3)                    
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.5)                    
#>  reprex        2.0.0       2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang         0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)      
#>  rmarkdown     2.8.1       2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 4.0.3)                    
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.3)                    
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.3)                    
#>  stringr       1.4.0       2019-02-10 [1] CRAN (R 4.0.3)                    
#>  tibble        3.1.2       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  utf8          1.2.1       2021-03-12 [1] CRAN (R 4.0.3)                    
#>  vctrs         0.3.8       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  withr         2.4.2       2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun          0.22        2021-03-11 [1] CRAN (R 4.0.5)                    
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.3)                    
#> 
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library

Created on 2021-06-10 by the reprex package (v2.0.0)

AlbertRapp
  • 408
  • 2
  • 9
  • 2
    The function `read_csv()` is not "using" the random seed. The function read the file provided by the filename. However, you are right, it breaks the applicability of the seed. Just read in the data and then set the seed. Your `rnorm()` will then work as expected. – Ray Jun 09 '21 at 18:12
  • What version of R? With R-4.0.3, I get `-0.56` after both `read.csv` and `readr::read_csv`. (Using `readr-1.3.1`.) – r2evans Jun 09 '21 at 18:42
  • 2
    Would definitely like to know `packageVersion("read_csv")` ... – Ben Bolker Jun 09 '21 at 18:58
  • 1
    i get `-0.5604756` as expected when i specify `col_types` – rawr Jun 09 '21 at 19:59
  • @rawr, do you get the alternate value when *not* using `col_types`? package version? @r2evans, did you read a copy of the `titanic` data set or some other data set? (I poked through the code a little bit look for calls to `sample()` or other RNG-using stuff,nothing jumped out at me) – Ben Bolker Jun 10 '21 at 01:00
  • @Ray, I know that I could easily "fix" this issue that way but still I want to understand why read_csv() breaks the seed. There should be nothing random about read_csv from my understanding, so the seed should not be affected. – AlbertRapp Jun 10 '21 at 15:54
  • @BenBolker I can confirm this also happens when reading a copy of other dataset (iris in my test). Also, it is solved by specifying `col_types` – MalditoBarbudo Jun 10 '21 at 18:06

1 Answers1

3

tl;dr somewhere deep in the guts of the cli package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.


A major clue is that

set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1)

runs read_csv guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing.

By making a copy of the random seed info (R <- .Random.seed) and stepping through the code (debug(readr::show_cols_spec)) and periodically running identical(R, .Random.seed) to check on the status, I found that the random info changes after running

cli::cli_h1("Column specification")

Debugging into that function, the change occurs somewhere in cli::cli__message; specifically, right before we execute this line

 if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid()

(which is here in the source code of cli), identical(R, .Random.seed) is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the args argument (e.g. by typing args in the debugger).

Working our way back up the chain and trying things by hand, we can see that manually evaluating

glue_cmd(text, .envir = .envir)

at this point in the code changes the random info.

Still more stepping through takes us to a point within glue_cmd where we call make_cmd_transformer where at this point we call a function called random_id():

values$marker <- random_id()

random_id() then calls sample ...

I have no idea why this internal bit of cli needs to be generating a random string, but I guess you could ask the maintainers?


This was done using readr 1.4.0 and cli 2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks for the detailed answer. I asked a question on the [GitHub](https://github.com/r-lib/cli/issues/294) about the purpose of `random_id()`. Maybe someone there can comment on this. – AlbertRapp Jun 11 '21 at 07:21
  • Quoting the maintainer of `cli`: "Random strings are often used in text processing, as markers, when you need to make sure that your marker does not appear in the text itself." Neverthelesss, it was classified as bug. – AlbertRapp Jun 11 '21 at 09:02