0

I'm doing for loop for 13 K pdf files, where it reads, pre-processes text, finds similarities and writes in txt. However, when I run the for loop after 760 pdf files, R session aborts. What can be the reason?

  1. I tried to write minimal code to reproduce the error. But I receive same issue.
  2. I tried to increase memory_limit(), it is also not the issue.
  3. I tried to delete hidden files in the folder, like Thumbs.db, but same issue appears again.
  4. I tried to divide 13 K pdf files into 4 folders, each (3,3K), and I got same error message Error in file(file, ifelse(append, "a", "w")) : can not open the connection. In addition: There are 50 warnings() and R session aborted.
  5. When I run pdf_folder[759:762], it reads perfectly fine without abort.

folder_path <- "C: ...."
## get vector with all pdf names
pdf_folder <- list.files(folder.path)

## for loop over all pdf documents
for(s in 1:length(pdf_folder)){
 # for(s in 1:2){
 tryCatch({


   ## choose one pdf document from vector of strings
   pdf_document_name <- pdf_folder[s]

   ## read pdf_document pdf into data.frame
   pdf <- read_pdf(paste0(folder_path,"/",pdf_document_name))

   print(s)

   rm(pdf)

 ## first end trycatch block
}, error = function(e){print(paste("Error: PDF Document not used: ",pdf_document_name, sep =""))}
 ) ## end of trycatch

} ## end of for loop

# Error: 

Error in file(file, ifelse(append, "a", "w")) : can not open the connection. In addition: There are 50 warnings()

The expected outcome is to read, pre-process all pdf documents in the folder.path.

  • What messages are on the console when R aborts? Is there something unique about the 760th and 761st files that make crashing at the point predictable? In other words, if you change from `1:length(pdf_folder)` to `pdf_folder[759:762]`, does it still crash on the same files? What is your `read_pdf` function doing? – r2evans Jun 15 '19 at 10:46
  • read_pdf function reads pdf documents as a data.frame. Messages on the console: Error in file(file, ifelse(append, "a", "w")) : can not open the connection. In addition: There are 50 warnings() I checked 760th and 761th nothing special, classic pdf documents. – Bakai Baiazbekov Jun 15 '19 at 11:12
  • Can you please put the text (verbatim in a code block) in the question? It's very relevant to the question, and comments can be missed/skipped/hidden. BTW: *"can not open the connection"* sounds like a filename does not exist. Perhaps you need `list.files(folder.path, full.names=TRUE)` instead (and then no need for `paste0(...)` within `read_pdf`). I wonder if somehow either the `paste0` is munging things just enough, or there's some other path-component being munged/missed. – r2evans Jun 15 '19 at 11:17
  • The issue of Error was `pdf <- read_pdf(paste0(folder_path,"/",pdf_document_name))` instead of `"/"` I put `"\\"`. – Bakai Baiazbekov Jun 15 '19 at 12:02
  • Now, after trying ` full.names=TRUE` and `read_pdf(pdf_document_name)` it aborts on 760 again. – Bakai Baiazbekov Jun 15 '19 at 12:35
  • 1
    Back to my original suggestions. What happens when you use `pdf_folder[759:762]`? – r2evans Jun 15 '19 at 12:45
  • +1 @r2evans I had this same issue with .png files (recursive =T,full.names=T) turns out that the file(s) were being appended with ‘~$’ because they were opened by another program. The list.files() was listing the copied file name (‘~$’). Worth a check at the fail point. – OctoCatKnows Jun 15 '19 at 12:51
  • When I run `pdf_folder[759:762]`, it reads perfectly fine without abort. – Bakai Baiazbekov Jun 15 '19 at 13:02
  • What I did -sorry, i know it’s frustrating; add a logic test. I did df$exist<-apply(df,2,file.exist) which gave me a quick way to note the couple of files with nonexistent names – OctoCatKnows Jun 15 '19 at 13:07
  • Bakai, you have some form of resource-exhaustion, it seems. It could be memory or perhaps (less likely) file-descriptors/connections. Is `read_pdf` something you can share? Is it too big? I find it curious that you capture the output as `pdf <- ...` but over-write it on the next pass in the `for` loop, discarding the return value. Are you doing something else there, perhaps `rbind`ing the results into some larger `frame`? – r2evans Jun 15 '19 at 13:10
  • In addition, I tried to divide 13 K pdf files into 4 folders, each (3,3K), and I got same error message `Error in file(file, ifelse(append, "a", "w")) : can not open the connection. In addition: There are 50 warnings()` and R session aborted. – Bakai Baiazbekov Jun 16 '19 at 09:06
  • Moreover, I tried to run same code on R console (not Rstudio) it runs without abort. However, it fails in Rstudio. – Bakai Baiazbekov Jun 16 '19 at 12:43
  • Maybe it's keeping the files open and it ran out of file handlers. – Adan Cortes Jun 16 '19 at 12:50
  • On every iteration of pdf_document I remove the pdf, `rm(pdf)`. – Bakai Baiazbekov Jun 24 '19 at 08:30
  • Has this been fixed? I have a very similar issue using `readtext::readtext()` on rounabout 120 small PDFs. – Dr. Fabian Habersack Mar 07 '23 at 17:08

0 Answers0