readr vs data.table different results on Fedora

Question

I replaced Ubuntu 20.04 with Fedora 37 on my laptop (clean install, 16 GB RAM) to follow my lab's standard and, curiously, readr doesn't work with a 6.7 GB csv file in this specific case (it crashes RStudio). What can explain this? readr worked with Ubuntu.

library(archive)

url <- "https://www.usitc.gov/data/gravity/itpd_e/itpd_e_r02.zip"
zip <- gsub(".*/", "", url)

if (!file.exists(zip)) {
  try(download.file(url, zip, method = "wget", quiet = T))
}

if (!length(list.files(getwd(), pattern = "ITPD_E_R02\\.csv")) == 1) {
  archive_extract(zip, dir = getwd())
}

# this will crash RStudio
# trade <- readr::read_csv("/ITPD_E_R02.csv")

# this won't
trade <- data.table::fread("/ITPD_E_R02.csv")

free memory

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           15699        4332        1106        1032       10259        9957
Swap:           8191           9        8182

session info

 sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 37 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.3

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.14.8 readr_2.1.4       archive_1.1.5    

loaded via a namespace (and not attached):
 [1] fansi_1.0.4       tzdb_0.3.0        utf8_1.2.3        R6_2.5.1          lifecycle_1.0.3  
 [6] magrittr_2.0.3    pillar_1.8.1      rlang_1.0.6       cli_3.6.0         rstudioapi_0.14  
[11] ellipsis_0.3.2    vctrs_0.5.2       tools_4.2.2       glue_1.6.2        hms_1.1.2        
[16] compiler_4.2.2    pkgconfig_2.0.3   CoprManager_0.5.0 tibble_3.1.8

Have you tried `readr::read_csv("/ITPD_E_R02.csv", lazy = TRUE)`? — Ritchie Sacramento, Mar 07 '23 at 03:15
Side note: `!length(list.files(getwd(), pattern = "ITPD_E_R02\\.csv")) == 1` is much simpler with `file.exists("ITPD_E_R02.csv")`. It doesn't need to look for and regex against filenames, and the use of `getwd()` is assumed with a relative path. — r2evans, Mar 07 '23 at 13:17
On a very simple benchmark, I tried `read_csv` and `fread`, the latter was faster but consumed 12% _more_ memory, my guess is because `data.table` tends to over-allocate memory (columns, etc) in order to support its fast operations. Both increased the process's RSS ("Resident Set Size", a common measure of non-swapped physical memory used by a process) by over 7GB, which _should_ be fine on your 16GB system, but as you might imagine, the OS and all other processes running will impact that. To really dive into this, one could also look into other system processes as well as available swap. — r2evans, Mar 07 '23 at 13:27
It _might_ also be due to a bug in `readr`, though honestly I'd be surprised: there seems nothing cosmic about this file that would (from my perspective) trigger a bug, it's just _big_. It's big enough, frankly, that I think you would benefit from using alternative access methods such as `arrow::open_dataset("~/Downloads/ITPD_E_R02.csv", format="csv")`, which supports _*lazy*_ operations within the `dplyr` dialect of R, see https://arrow.apache.org/docs/r/articles/dataset.html. — r2evans, Mar 07 '23 at 13:32
Side note: I have 64GB of ram, so I gave it a few tries (new process each time). I was able to use `utils::read.csv` (231sec), `readr::read_csv` (62sec), `data.table::fread` (25sec), and `arrow::open_dataset(.., format="csv")` (0.02sec). All except the last one took 20-40 seconds and consumed 7GB+ of RSS. I was not able to use `arrow::read_csv_arrow`, it crashed with `src/arrow/result.cc:28: ValueOrDie called on an error: Out of memory: malloc of size 262144 failed`, which given the amount of free ram I have, is most likely a bug. — r2evans, Mar 07 '23 at 13:37
FYI, [arrow#34487](https://github.com/apache/arrow/issues/34487) for the `read_csv_arrow` crash. — r2evans, Mar 07 '23 at 13:51

readr vs data.table different results on Fedora

0 Answers0