1

All of the code included in this question is from the script called "LASSO code (Version for Antony)" in my GitHub Repo for this project. And you can run it on the file folder called "last 40" to verify my claim that it does run on limited sized datasets and if you really feel like going the extra mile, message me here and I'll share a 10k scale file folder full of datasets zipped of via OneDrive or Google Drive (whichever you prefer lad) with ya so you can also verify that the same script doesn't work in file folders of that volume.

This is absolutely going to drive me mad I swear, I have been using the lappy function below without issue for a week now, and starting several hours ago, it is giving me this error:

> datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
Error in checkForRemoteErrors(val) : 
  7 nodes produced errors; first error: could not find function "fread" 

Here is the rest of the script I am working with up until this line (after the lines I used to load all of the libraries I utilize):

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/12th & 13th 10k"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)


# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
  # split apart the numbers, convert them to numeric 
  strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
  # get them in a data frame
  matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)

DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- makeCluster(detectCores() - 2L)
clusterExport(CL, c('paths_list'))
library(data.table)
system.time( datasets <- parLapply(CL, paths_list, fread) )

After looking up the documentation for the 3rd time today, I am thinking of trying:

system.time( datasets <- parLapply(CL, paths_list, fun = fread) )

Will that work??

p.s. Here is all of the libraries I load as the first thing I do:

# load all necessary packages
library(plyr)
library(dplyr)
library(tidyverse)
library(readr)
library(stringi)
library(purrr)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)

Also, I have already tried the following and none worked:

datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
datasets <- parLapply(CL, paths_list, function(i) {fread[i]})
datasets <- parLapply(CL, paths_list, function(i) {fread[[i]]})

datasets <- parLapply(CL, paths_list, \(ds) 
                      {fread(ds)})

system.time( datasets <- lapply(paths_list, fread) )

And when I run that last one, datasets <- lapply(paths_list, fread), I get the same error, this was exactly the original successful version I ran at the beginning of last week and I only chose to use the parallel version because the datasets folder I am importing/loading has 260,000 csv file-formatted datasets in it. So, this means two version which have worked dozens of times already just stopped working suddenly today!

wibeasley
  • 5,000
  • 3
  • 34
  • 62
Marlen
  • 171
  • 11
  • 1
    What happens if you qualify the function with its package (ie, replace `fread()` with `data.table::fread()`)? – wibeasley Jan 08 '23 at 17:45
  • Any chance it's related to this? https://stackoverflow.com/q/18035711/1082435 – wibeasley Jan 08 '23 at 17:47
  • @wibeasley ahh, good call m8, I will try that as soon as a new idea I am trying on just 10k datasets instead of the full 260k finishes running. I made a mistake when typing the original post which made it seem like the function failing was the alteration I am running right now, namely > system.time( datasets <- parLapply(CL, paths_list, fun = fread) ), but the actual one which was is failing is system.time( datasets <- parLapply(CL, paths_list, fread) ) – Marlen Jan 08 '23 at 17:52
  • @wibeasley as for your second suggestion, I am not sure I fully understand the answer to that question, however, I have been using the exact same code without any issues for about 5 days now over a dozen times. And, even stranger, when I tried it on just 100 data sets in the datasets object instead of 10,000 or 260,000 an hour ago, that still works. It just fails when scaling up! I used the same script in two different RStudio windows seconds apart. I am speechless honestly, I thought coding is objective lol – Marlen Jan 08 '23 at 17:55
  • @wibeasley ... welp, I just finally got the following error after waiting for like 10 minutes: > system.time( datasets <- parLapply(CL, paths_list, fun = fread) ) Error in checkForRemoteErrors(val) : one node produced an error: Input is empty or only contains BOM or terminal control characters Timing stopped at: 1392 895.7 1710 So, now I am trying your suggestion of using: system.time( datasets <- parApply(CL, paths_list, data.table::fread()) ) I will let you know if it works, fingers crossed! – Marlen Jan 08 '23 at 18:11
  • @wibeasley sorry man, it did not work > system.time( datasets <- parApply(CL, paths_list, data.table::fread()) ) Error in match.fun(FUN) : argument "FUN" is missing, with no default Timing stopped at: 0.04 0.01 0.06 – Marlen Jan 08 '23 at 18:20
  • I haven't used the parallel apply functions much, but I remember that it sometimes takes effort to make sure each worker [environment](https://adv-r.hadley.nz/environments.html) is loaded correctly & consistently. That's related to the link (in my second comment) – wibeasley Jan 08 '23 at 18:39
  • 1
    I'm losing track a little of the versions. I misspoke: I think it should be `data.table::fread` (without the parentheses). – wibeasley Jan 08 '23 at 18:40
  • 1
    If you make your example reproducible (eg, don't use datasets in a local directory), it will be easier for others to experiment on their own machines. But I know your example is tougher than most for this. – wibeasley Jan 08 '23 at 18:42
  • 1
    @wibeasley good point in your last comment, I was literally seeing red when typing this question originally, so I forgot to add in a link to my GitHub repository with the code that has a subset with only 40 datasets you can run for yourself to verify that it indeed does run just fine. – Marlen Jan 08 '23 at 18:46

1 Answers1

1

See if this works consistently. It hasn't failed yet on my Windows desktop with 20k files (I copied & pasted your 40 files a bunch). It's run 5 times and I've restarted the R session and RStudio each time.

It's too bad that the problem arises non-deterministically, but that's part of the parallel-computation game. See if this stripped-down example run consistently?

Notice I'm avoiding library() to eliminate naming collisions caused by packages with identically-named functions. Also, I closed the cluster connection at the end.

# Enumerate files
paths_list <- 
  "~/Documents/delete-me/EER-Research-Project-main/20k" |> 
  list.files(full.names = T, recursive = T)

# Establish cluster
CL <- parallel::makeCluster(parallel::detectCores() - 2L)
parallel::clusterExport(CL, c('paths_list'))

# Read files
system.time({
  datasets <- parallel::parLapply(CL, paths_list, data.table::fread)
})

# Stop cluster
parallel::stopCluster(CL)

#>    user  system elapsed 
#>    7.09    1.22  101.93 
wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • 1
    WOW, maybe there is a god and he doesn't hold me being a ginger against me after all! It works again, hallelujah. Do you have a PayPal or a Venmo kind sir? I am broke, but I'd like to send you a few bucks – Marlen Jan 08 '23 at 19:27
  • 1
    That's a nice offer --instead take some grad student to lunch after you graduate. I'm glad it works on your Windows machine. I didn't test it on any of mine, and I remember there are backend differences between parallel's implementation on Windows & Linux. – wibeasley Jan 08 '23 at 20:08
  • actually, I appear to have spoken too soon although I assure you this is not your fault. It ran on 10, 40, 100, and 1,000 datasets, but not on 10,000. – Marlen Jan 08 '23 at 23:08
  • 1
    Bummer. I simplified things in response. See the edited version. – wibeasley Jan 09 '23 at 02:09
  • 1
    @wibeasely it is running again now, I was actually able to get it to work using your suggestions mainly and a bit of tinkering. The final version now is this CL <- makeCluster(detectCores() - 3L) clusterExport(CL, c('paths_list')) system.time( datasets <- parLapply(cl = CL, X = paths_list, fun = data.table::fread)) But, I am going to add in the parallel:: suggestions as well right now because I have been pushed to the point of paranoia by circumstances! – Marlen Jan 10 '23 at 08:33