My supervisor wants me to convert .pdf
files to .txt
files to be processed by a keyword extraction algorithm. The .pdf
files are scanned court documents. She essentially wants a folder called court_document
with subdirectories each named a 13-character case ID. I received about 500 .pdf
files with file names "caseid_docketnumber_date_documentdescription.pdf", e.g. "1-20-cr-30164_d2_5_23_2020_complaint.pdf". She also wants each .txt
file to be saved as "docketnumber_date_documentdescription.txt", e.g. "d2_5_23_2020_complaint.txt". The .pdf
files are saved in my working directory court_document
. The desired outcome is a root directory called court_document
with 500 subdirectories each containing .txt
files. I approached the problem as follows:
# Packages ---------------------------------------------------------------
library(tesseract)
library(pdftools)
library(magrittr)
library(purrr)
library(bench)
# Function to convert .pdf to .txt ----------------------------------------
pdf_convert_txt <- function(pdf) {
# Case id
# The pdf file names are such that the first 13 characters are the case id's
case_id <- str_sub(
string = pdf,
start = 1L,
end = 13L
)
# File path for writing .txt file to subdirectory
txt_file_path <- paste0(
# Subdirectory
paste0(case_id, "/"),
# Ensure .txt file name does not include case id (first 14 char) and .pdf extension (last 4 char)
str_sub(
string = pdf,
start = 15L,
end = -5L
),
# File extension
".txt"
)
# Create subdirectory using case id as its name
if (dir.exits(paths = case_id) == FALSE) dir.create(path = case_id)
# Convert pdf to png
# This function creates one .png file per pdf page in current working dir
# It also returns a character vector of .png file names
pdf_convert(
pdf = pdf,
format = "png",
dpi = 200,
) %>%
# Pass the character vector of .png file names to tesseract::ocr()
# This function returns plain text by default
ocr(image = .) %>%
# Concatenate and save plain text .txt file to subdirectory created above
cat(file = txt_file_path)
# Remove all png files in current working directory
file.remove(
list.files(pattern = ".png")
)
}
# Apply pdf_convert_txt() to all .pdf files in current working dir -------------------
map(
# All .pdf files in current working directory court_document
.x = list.files(pattern = ".pdf"),
.f = pdf_convert_txt
)
This solution works but profiling reveals that ocr(image = .)
really slows down the code. A typical court document has at least 50 pages, so 50 png files from which texts are to be extracted. This one line alone takes about 72000 ms to run on my intel macbook pro 2020. I just have so many .pdf
files and so I'm wondering if there's any way to break through this bottleneck. Or perhaps I need to switch to other tools. Any advice and suggestions will be greatly appreciated.