0

I am running a 64-bit R/RStudio on a 64-bit Windows 10. Pc has 16GB of RAM and runs on 8-cores.

So RStudio crashes at around 1.6/7 GB of memory utilization while reading a larger dataset.

So I'm trying to use the parallel package to execute the operation with multiple cores. But somewhere I'm making a mistake.

library("data.table")
library("lubridate")
library("parallel")
library("foreach")
library("doParallel")

cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl, cores = detectCores() - 2)

files = list.files(pattern="public")
myfiles = do.call(rbind, lapply(files, function(x) fread(x, colClasses=c(ID="character")))) 

I don't have much experience with parallel processing.

Can you please let me know where am I getting it wrong?

Update:

R has no problem creating a 8gb object in memory.

bigint <- integer(2^32 / 2)

Still not sure what is limiting the reading of the data.

Update 2:

I made a diagnostic report. These are the errors that I'm getting.

24 Jan 2019 23:10:33 [rdesktop] ERROR system error 231 (All pipe instances are busy); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:11:47 [rdesktop] ERROR system error 231 (All pipe instances are busy); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:11:47 [rdesktop] ERROR system error 231 (All pipe instances are busy); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:13:39 [rdesktop] ERROR system error 231 (All pipe instances are busy); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:13:40 [rdesktop] ERROR system error 231 (All pipe instances are busy); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:13:41 [rdesktop] ERROR system error 232 (The pipe is being closed); OCCURRED AT: void rstudio::core::http::AsyncClient<SocketService>::handleWrite(const rstudio_boost::system::error_code&) [with SocketService = rstudio_boost::asio::windows::basic_stream_handle<>] C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/AsyncClient.hpp:342; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:13:42 [rdesktop] ERROR system error 2 (The system cannot find the file specified); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
24 Jan 2019 23:13:42 [rdesktop] ERROR system error 2 (The system cannot find the file specified); OCCURRED AT: virtual void rstudio::core::http::NamedPipeAsyncClient::connectAndWriteRequest() C:/Users/Administrator/rstudio/src/cpp/core/include/core/http/NamedPipeAsyncClient.hpp:84; LOGGED FROM: void rstudio::desktop::NetworkReply::onError(const rstudio::core::Error&) C:\Users\Administrator\rstudio\src\cpp\desktop\DesktopNetworkReply.cpp:288
Prometheus
  • 1,977
  • 3
  • 30
  • 57
  • @Dave2e I just updated the question. Still confused on what might be the issue. – Prometheus Jan 25 '19 at 02:49
  • `fread()` is already parallelized. So, I don't think reading multiple files in parallel would be any faster than reading them one by one. This strategy might also reduce the memory footprint. – F. Privé Jan 25 '19 at 04:10
  • @F.Privé right. Well, I think that I have been able to load bigger files in the past on the same computer. And as I mentioned in the update, R has no problem keeping in memory +10GB files. I just don't have any clue why the IDE is crushing around 1.6GB memory usage when I use fread. – Prometheus Jan 25 '19 at 15:13

1 Answers1

0

If a non-parallel process uses R GB of RAM, a parallel process with C cores will need approximately R*C GB of RAM. I suggest gradually increase, beginning with 2 cores and monitoring your RAM usage.

zabala
  • 103
  • 1
  • 7