1

I'm trying to download data from a website via a text file, then load it into R via read_tsv using the following code:

url_jobzones <-
"https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt"

jobzones_file_name <- "jobzones.txt"

download.file(url_jobzones, destfile= paste("data/EdExTrain", jobzones_file_name, sep="/"))

Those three lines work fine, the text file opens on windows just fine, but running the following line of code will result in a fatal error and crash RStudio:

> jobzones <- read_tsv("data/EdExTrain/jobzones.txt")

Any suggestions?

For additional context, running similar code on other files on that web site don't result in the error and work great:

url_skills <- "https://www.onetcenter.org/dl_files/database/db_27_3_text/Skills.txt"

skills_file_name <- "skills.txt"

download.file(url_skills, destfile= paste("data/KSAs", skills_file_name, sep="/"))

skills <- read_tsv("data/KSAs/skills.txt")

In fact, most of the same code for the majority of the text files that appear on that website does not result in the error (https://www.onetcenter.org/database.html#individual-files)

The only files that seem to make it crash are:

  • Job Zones
  • Basic Interests to RIASEC
  • Tools Used
  • Tasks to DWA
  • Sample of Reported Titles
  • Abilities to Work Activities

Any and all input would be most welcome. Happy to post the full code I'm using if that would be helpful (242 lines total, most of it is copy/paste format like you see above). Thank you in advance.

  • 2
    I downloaded the file into my `Downloads` folder and then `readr::read_tsv("/Users/gregort/Downloads/Job Zones.txt")` worked just fine. I'd suggest trying to isolate the problem a bit more - does it work with a manual download? Is there any error message? Is the problem only with `read_tsv` - do other functions like `read.table()` or `data.table::fread()` work on it? – Gregor Thomas Jun 09 '23 at 22:01
  • If you read the file directly without downloading does it work? `readr::read_tsv("https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt")` also works fine for me. – Gregor Thomas Jun 09 '23 at 22:02

1 Answers1

2

Can't tell why it crashes instead of throwing an error(*), but without using wb mode, download.file() on Windows modifies line endings during text transfer and that can confuse parsers, especially for text files with \r\n (carriage return + line feed) line endings as those get converted to \r\r\n, so the number of rows is doubled, every 2nd being empty. You can check downloaded files with some editor with decent non-printing character display (e.g. Notepad++ ) and/or use print() / stringr::str_view() on a string read from the file to debug such cases:

library(stringr)
library(readr)

url_txt_nok <- "https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt"

dl_txt <- function(url_, ...){
  tmp_txt <- tempfile(fileext = ".txt")
  download.file(url_, destfile = tmp_txt, ...)
  tmp_txt
}

# download.file without mode = "wb", existing \r\n line endings are converted to \r\r\n
txt_nok_mode_def <- dl_txt(url_txt_nok)
# escape all non-ASCII chars to display all non-printing characters, including \n
read_file(txt_nok_mode_def) |> str_trunc(200) |> str_view(use_escapes = TRUE)
#> [1] │ O*NET-SOC Code\tJob Zone\tDate\tDomain Source\r\r\n11-1011.00\t5\t07/2014\tAnalyst\r\r\n11-1011.03\t5\t08/2021\tAnalyst\r\r\n11-1021.00\t4\t07/2015\tAnalyst\r\r\n11-1031.00\t4\t06/2008\tAnalyst\r\r\n11-2011.00\t4\t08/2018\tAnalyst...
read_file(txt_nok_mode_def) |> str_trunc(200) |> str_view()
#> [1] │ O*NET-SOC Code{\t}Job Zone{\t}Date{\t}Domain Source{\r\r}
#>     │ 11-1011.00{\t}5{\t}07/2014{\t}Analyst{\r\r}
#>     │ 11-1011.03{\t}5{\t}08/2021{\t}Analyst{\r\r}
#>     │ 11-1021.00{\t}4{\t}07/2015{\t}Analyst{\r\r}
#>     │ 11-1031.00{\t}4{\t}06/2008{\t}Analyst{\r\r}
#>     │ 11-2011.00{\t}4{\t}08/2018{\t}Analyst...
read_lines(txt_nok_mode_def)[1:5]
#> [1] "O*NET-SOC Code\tJob Zone\tDate\tDomain Source"
#> [2] ""                                             
#> [3] "\n11-1011.00\t5\t07/2014\tAnalyst"            
#> [4] ""                                             
#> [5] "\n11-1011.03\t5\t08/2021\tAnalyst"

# download.file with mode = "wb"
txt_nok_mode_wb  <- dl_txt(url_txt_nok, mode ="wb")
read_file(txt_nok_mode_wb) |> str_trunc(200) |> str_view(use_escapes = TRUE)
#> [1] │ O*NET-SOC Code\tJob Zone\tDate\tDomain Source\r\n11-1011.00\t5\t07/2014\tAnalyst\r\n11-1011.03\t5\t08/2021\tAnalyst\r\n11-1021.00\t4\t07/2015\tAnalyst\r\n11-1031.00\t4\t06/2008\tAnalyst\r\n11-2011.00\t4\t08/2018\tAnalyst\r\n11-...
read_file(txt_nok_mode_wb) |> str_trunc(200) |> str_view()
#> [1] │ O*NET-SOC Code{\t}Job Zone{\t}Date{\t}Domain Source{\r}
#>     │ 11-1011.00{\t}5{\t}07/2014{\t}Analyst{\r}
#>     │ 11-1011.03{\t}5{\t}08/2021{\t}Analyst{\r}
#>     │ 11-1021.00{\t}4{\t}07/2015{\t}Analyst{\r}
#>     │ 11-1031.00{\t}4{\t}06/2008{\t}Analyst{\r}
#>     │ 11-2011.00{\t}4{\t}08/2018{\t}Analyst{\r}
#>     │ 11-...
read_lines(txt_nok_mode_wb)[1:5]
#> [1] "O*NET-SOC Code\tJob Zone\tDate\tDomain Source"
#> [2] "11-1011.00\t5\t07/2014\tAnalyst"              
#> [3] "11-1011.03\t5\t08/2021\tAnalyst"              
#> [4] "11-1021.00\t4\t07/2015\tAnalyst"              
#> [5] "11-1031.00\t4\t06/2008\tAnalyst"

Created on 2023-06-10 with reprex v2.0.2

*) unlike read_tsv(), read.table() and read_delim(delim = "\t") handle this more or less successfully, former fails as line 2 did not have 7 elements, apparently those "empty" rows are not empty for read.table(); latter handles blank rows and just skips those, i.e. content gets parsed.

The crash itself is reproducible at my end too, R session in RStudio gets terminated with R Session Aborted \ R encountered a fatal error.\ The session was terminated. message and when running readr::read_tsv() through RScript, it segfaults:

$ Rscript.exe -e 'sessionInfo(); download.file("https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt", destfile="jobzones.txt"); readr::read_tsv("jobzones.txt")'
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
...
trying URL 'https://www.onetcenter.org/dl_files/database/db_27_3_text/Job%20Zones.txt'
Content type 'text/plain' length 28693 bytes (28 KB)
==================================================
downloaded 28 KB

Segmentation fault
margusl
  • 7,804
  • 2
  • 16
  • 20
  • I’d guess that this doesn’t actually cause a crash on OP’s computer and that they instead didn’t describe the problem accurately. – Konrad Rudolph Jun 10 '23 at 08:42
  • @KonradRudolph, the crash is actually reproducible (Windows 10, R4.2.3, rstudio 2023.03.1) – margusl Jun 10 '23 at 11:02