ISSUE:
I try to extract multiple keywords and their surrounding text from a suite of PDF documents in English, Spanish, and French. For English PDF documents it works like a charm, but not for terms that contain non-latin letters in Spanish and French (e.g., é, ê, ô). Code for reading English PDFs:
library(textreadr)
library(pdftools)
library(pdfsearch)
keyword = c('biology') # define searched keyword
dirct <- "~/Documents/pdfs" # define directory
### keyword search
result <- keyword_directory(dirct,
keyword = keyword,
surround_lines = 0, full_names = TRUE)
Running the same code for terms with letters specific to French or Spanish (e.g., é, ê, ô) does not yield any results.
WHAT I HAVE TRIED:
I saw that the letters are converted into different unicode:
keyword = c('biología') # keyword
""biolog\303\255a" # the keyword how its listed in Values
""biolog<U+00E1>" # unicode the *keyword_directory* function converts the keyword to
I have tried to change the keyword search to the unicode but this didnt yield any results.
keyword = c('biolog\303\255a') / keyword = c('biolog<U+00E1>')
I'm stuck with the keyword_directory function because it extracts both keywords and surrounding text from the PDF's.