Error while doing ocr on pdf in r

Question

Trying OCR on pdf in r and it is giving me the error. After running the code the "i.txt" file is also been generated, but still the error is getting.

pdftoppm version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftoppm [options] <PDF-file> <PPM-root>
  -f <int>          : first page to print
  -l <int>          : last page to print
  -r <number>       : resolution, in DPI (default is 150)
  -mono             : generate a monochrome PBM file
  -gray             : generate a grayscale PGM file
  -freetype <string>: enable FreeType font rasterizer: yes, no
  -aa <string>      : enable font anti-aliasing: yes, no
  -aaVector <string>: enable vector anti-aliasing: yes, no
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
convert.exe: unable to open image '*.ppm': Invalid argument @ error/blob.c/OpenBlob/3146.
convert.exe: no images defined `D:/PDF_OCR_File/test.pdf.tif' @ error/convert.c/ConvertImageCommand/3275.
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
[[1]]
[1] FALSE

Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' had status 99 
2: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ",  :
  '"D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' execution failed with error code 99
3: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' had status 1 
4: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ",  :
  '"D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' execution failed with error code 1
5: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' had status 1 
6: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ",  :
  '"D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' execution failed with error code 1
7: In file.remove(paste0(i, ".tiff")) :
  cannot remove file 'D:/PDF_OCR_File/test.pdf.tiff', reason 'No such file or directory'

My setwd() is this "D:/PDF_OCR_File"

This is the code on which I get error

dest <- "D:/PDF_OCR_File"
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})


myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)




lapply(myfiles, function(i){

  shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ", i, " -f 1 -l 2 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tiff" ))
})

I don't know where it is getting wrong, or what mistake I'm making. Any suggestion will be helpful, Thanks.

Looks like the pdf-ppm command failed somehow, so the next command fails. Try to get the first command working in the terminal. You can use the `magick` package for OCR — Richard Telford, Sep 20 '17 at 11:04

score 0 · Answer 1 · 2017-11-06T20:08:17.703

I bet you are using this for your code, example, huh? I found a lot of issues with that code as well as some antiquated syntax.

The solution I've come up with is this:

  dest <- "C:\\users\\YOURNAME\\desktop"

  files <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

  sapply(files, FUN = function(a){
  file.rename(from = a, to =  paste0(dirname(a), "/", gsub(" ", "", basename(a))))
      })

      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 70 ", i,".pdf", " ",i)))
      })


  myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
    lapply(myppms, function(y){
      shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
      file.remove(paste0(y,".ppm"))
      })

  mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
    lapply(mytiffs, function(z){
      shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
      file.remove(paste0(z,".tif"))
      })

The first problem with the GitHub snippet is that the options are both missing pieces and are in the wrong place for CMD to understand, which is why you are getting the help menu. "ocrbook" is the output file name (which is bad if you want to do more than one file), so you are going to get a PPM, PNG, whatever file named "ocrbook-000001.png". The issue with the function(i) in that block of code is that it is looking for the "originalpdfname.pdf.png" instead of the filename that was converted "ocrbook-000001". I fixed that by creating a function within a function to find the PNG files and put them into (z).

Tesseract [is supposed to] convert PNG files just fine, so there is no need to use ImageMagick to covert from a PPM to TIFF. Just use xPDF to convert the PDF to a PNG. However, in the GitHub example, the ImageMagick syntax is outdated. "convert" apparently clashes with another CMD command, so it was changed in later iterations to "magick". See here. For consistency I used the converter in the example anyways.

Another thing about that code example is that tesseract defaults to English... this may be something that was created with newer versions, so there is no longer a need to specify "-l eng" anymore. See here. "out" apparently is the exported txt file name (just purely from observation), and you will need to strip the path down and use it in a function so that it mimics the original file name and doesn't overwrite each time it runs the OCR on a new file.

Error while doing ocr on pdf in r

1 Answers1

Linked