Convert many .pdf files to .txt files using the new Tesseract OCR engine in R

Question

My supervisor wants me to convert .pdf files to .txt files to be processed by a keyword extraction algorithm. The .pdf files are scanned court documents. She essentially wants a folder called court_document with subdirectories each named a 13-character case ID. I received about 500 .pdf files with file names "caseid_docketnumber_date_documentdescription.pdf", e.g. "1-20-cr-30164_d2_5_23_2020_complaint.pdf". She also wants each .txt file to be saved as "docketnumber_date_documentdescription.txt", e.g. "d2_5_23_2020_complaint.txt". The .pdf files are saved in my working directory court_document. The desired outcome is a root directory called court_document with 500 subdirectories each containing .txt files. I approached the problem as follows:

# Packages  ---------------------------------------------------------------
library(tesseract)
library(pdftools)
library(magrittr)
library(purrr)
library(bench)

# Function to convert .pdf to .txt ----------------------------------------
pdf_convert_txt <- function(pdf) {

  # Case id
  # The pdf file names are such that the first 13 characters are the case id's
  case_id <- str_sub(
    string = pdf,
    start = 1L,
    end = 13L
  )
  # File path for writing .txt file to subdirectory
  txt_file_path <- paste0(
    # Subdirectory
    paste0(case_id, "/"),
    # Ensure .txt file name does not include case id (first 14 char) and .pdf extension (last 4 char)
    str_sub(
      string = pdf,
      start = 15L,
      end = -5L
    ),
    # File extension
    ".txt"
  )

  # Create subdirectory using case id as its name
  if (dir.exits(paths = case_id) == FALSE) dir.create(path = case_id)

  # Convert pdf to png
  # This function creates one .png file per pdf page in current working dir
  # It also returns a character vector of .png file names
  pdf_convert(
    pdf = pdf,
    format = "png",
    dpi = 200,
  ) %>%
    # Pass the character vector of .png file names to tesseract::ocr()
    # This function returns plain text by default
    ocr(image = .) %>%
    # Concatenate and save plain text .txt file to subdirectory created above
    cat(file = txt_file_path)

  # Remove all png files in current working directory
  file.remove(
    list.files(pattern = ".png")
  )
}

# Apply pdf_convert_txt() to all .pdf files in current working dir -------------------
map(
  # All .pdf files in current working directory court_document
  .x = list.files(pattern = ".pdf"),
  .f = pdf_convert_txt
)

This solution works but profiling reveals that ocr(image = .) really slows down the code. A typical court document has at least 50 pages, so 50 png files from which texts are to be extracted. This one line alone takes about 72000 ms to run on my intel macbook pro 2020. I just have so many .pdf files and so I'm wondering if there's any way to break through this bottleneck. Or perhaps I need to switch to other tools. Any advice and suggestions will be greatly appreciated.

ocr is inherently slow. you could speed up the process by using the package `furrr` and use `plan` and `future_map` to run in parallel. See [here](https://furrr.futureverse.org/) for more info. — phiver, Sep 24 '21 at 09:51
Yes. This may be helpful; I’m going to look into ways to implement this. I’ll update the question once I figure it out. Thanks for the pointer. — Yang Wu, Sep 24 '21 at 12:44

score 1 · Accepted Answer · answered Sep 25 '21 at 01:18

Following phiver's suggestion and some experimenting on my own, I was able to cut down the run time of the following chunk of code by about 40% for my typical pdf with 50 pages even before using multisession:

  pdf_convert(
    pdf = pdf,
    format = "png",
    dpi = 80,
  ) %>%
    ocr(image = .) %>%
    cat(file = txt_file_path)

I did so by reducing the resolution (the dpi argument) when converting from .pdf to .png. Fortunately, the type of .pdf files that I am working with does not require high resolution for the OCR engine to pick up characters from the images. Lastly, in order to use multisession (parallelly::plan() + furrr::future_map()), I took the following chunk outside of the function:

  file.remove(
    list.files(pattern = ".png")
  )

Since I am running parallel processes, I needed to take out the above chunk, or else a single process would remove all .png files in the working directory, including those needed to complete the other processes.

Could you use `tesseract::ocr` within `future_map()`? I get the error "Error: Detected a non-exportable reference (‘externalptr’ of class ‘tesseract’) in one of the globals (‘tesseract’ of class ‘function’) used in the future expression" — ava, Nov 30 '22 at 23:05
@ava See if the following [link](https://future.futureverse.org/articles/future-4-non-exportable-objects.html) helps with trouble shooting for your case. — Yang Wu, Dec 02 '22 at 01:48

Convert many .pdf files to .txt files using the new Tesseract OCR engine in R

1 Answers1