Can't tell why it crashes instead of throwing an error(*), but without using wb
mode, download.file()
on Windows modifies line endings during text transfer and that can confuse parsers, especially for text files with \r\n
(carriage return + line feed) line endings as those get converted to \r\r\n
, so the number of rows is doubled, every 2nd being empty. You can check downloaded files with some editor with decent non-printing character display (e.g. Notepad++ ) and/or use print()
/ stringr::str_view()
on a string read from the file to debug such cases:
library(stringr)
library(readr)
url_txt_nok <- "https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt"
dl_txt <- function(url_, ...){
tmp_txt <- tempfile(fileext = ".txt")
download.file(url_, destfile = tmp_txt, ...)
tmp_txt
}
# download.file without mode = "wb", existing \r\n line endings are converted to \r\r\n
txt_nok_mode_def <- dl_txt(url_txt_nok)
# escape all non-ASCII chars to display all non-printing characters, including \n
read_file(txt_nok_mode_def) |> str_trunc(200) |> str_view(use_escapes = TRUE)
#> [1] │ O*NET-SOC Code\tJob Zone\tDate\tDomain Source\r\r\n11-1011.00\t5\t07/2014\tAnalyst\r\r\n11-1011.03\t5\t08/2021\tAnalyst\r\r\n11-1021.00\t4\t07/2015\tAnalyst\r\r\n11-1031.00\t4\t06/2008\tAnalyst\r\r\n11-2011.00\t4\t08/2018\tAnalyst...
read_file(txt_nok_mode_def) |> str_trunc(200) |> str_view()
#> [1] │ O*NET-SOC Code{\t}Job Zone{\t}Date{\t}Domain Source{\r\r}
#> │ 11-1011.00{\t}5{\t}07/2014{\t}Analyst{\r\r}
#> │ 11-1011.03{\t}5{\t}08/2021{\t}Analyst{\r\r}
#> │ 11-1021.00{\t}4{\t}07/2015{\t}Analyst{\r\r}
#> │ 11-1031.00{\t}4{\t}06/2008{\t}Analyst{\r\r}
#> │ 11-2011.00{\t}4{\t}08/2018{\t}Analyst...
read_lines(txt_nok_mode_def)[1:5]
#> [1] "O*NET-SOC Code\tJob Zone\tDate\tDomain Source"
#> [2] ""
#> [3] "\n11-1011.00\t5\t07/2014\tAnalyst"
#> [4] ""
#> [5] "\n11-1011.03\t5\t08/2021\tAnalyst"
# download.file with mode = "wb"
txt_nok_mode_wb <- dl_txt(url_txt_nok, mode ="wb")
read_file(txt_nok_mode_wb) |> str_trunc(200) |> str_view(use_escapes = TRUE)
#> [1] │ O*NET-SOC Code\tJob Zone\tDate\tDomain Source\r\n11-1011.00\t5\t07/2014\tAnalyst\r\n11-1011.03\t5\t08/2021\tAnalyst\r\n11-1021.00\t4\t07/2015\tAnalyst\r\n11-1031.00\t4\t06/2008\tAnalyst\r\n11-2011.00\t4\t08/2018\tAnalyst\r\n11-...
read_file(txt_nok_mode_wb) |> str_trunc(200) |> str_view()
#> [1] │ O*NET-SOC Code{\t}Job Zone{\t}Date{\t}Domain Source{\r}
#> │ 11-1011.00{\t}5{\t}07/2014{\t}Analyst{\r}
#> │ 11-1011.03{\t}5{\t}08/2021{\t}Analyst{\r}
#> │ 11-1021.00{\t}4{\t}07/2015{\t}Analyst{\r}
#> │ 11-1031.00{\t}4{\t}06/2008{\t}Analyst{\r}
#> │ 11-2011.00{\t}4{\t}08/2018{\t}Analyst{\r}
#> │ 11-...
read_lines(txt_nok_mode_wb)[1:5]
#> [1] "O*NET-SOC Code\tJob Zone\tDate\tDomain Source"
#> [2] "11-1011.00\t5\t07/2014\tAnalyst"
#> [3] "11-1011.03\t5\t08/2021\tAnalyst"
#> [4] "11-1021.00\t4\t07/2015\tAnalyst"
#> [5] "11-1031.00\t4\t06/2008\tAnalyst"
Created on 2023-06-10 with reprex v2.0.2
*) unlike read_tsv()
, read.table()
and read_delim(delim = "\t")
handle this more or less successfully, former fails as line 2 did not have 7 elements
, apparently those "empty" rows are not empty for read.table()
; latter handles blank rows and just skips those, i.e. content gets parsed.
The crash itself is reproducible at my end too, R session in RStudio gets terminated with R Session Aborted \ R encountered a fatal error.\ The session was terminated.
message and when running
readr::read_tsv()
through RScript
, it segfaults:
$ Rscript.exe -e 'sessionInfo(); download.file("https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt", destfile="jobzones.txt"); readr::read_tsv("jobzones.txt")'
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
...
trying URL 'https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt'
Content type 'text/plain' length 28693 bytes (28 KB)
==================================================
downloaded 28 KB
Segmentation fault