0

ISSUE:

I try to extract multiple keywords and their surrounding text from a suite of PDF documents in English, Spanish, and French. For English PDF documents it works like a charm, but not for terms that contain non-latin letters in Spanish and French (e.g., é, ê, ô). Code for reading English PDFs:

library(textreadr)
library(pdftools)
library(pdfsearch)

keyword = c('biology') # define searched keyword 

dirct <- "~/Documents/pdfs" # define directory

### keyword search
result <- keyword_directory(dirct, 
                          keyword = keyword,
                          surround_lines = 0, full_names = TRUE)

Running the same code for terms with letters specific to French or Spanish (e.g., é, ê, ô) does not yield any results.

WHAT I HAVE TRIED:

I saw that the letters are converted into different unicode:

keyword = c('biología') # keyword 

""biolog\303\255a" # the keyword how its listed in Values

""biolog<U+00E1>" # unicode the *keyword_directory* function converts the keyword to

I have tried to change the keyword search to the unicode but this didnt yield any results.

keyword = c('biolog\303\255a') / keyword = c('biolog<U+00E1>')

I'm stuck with the keyword_directory function because it extracts both keywords and surrounding text from the PDF's.

LauraGE
  • 1
  • 1

1 Answers1

0

Maybe you can try the following replacements (see "Hex code point" in the webpage http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A9&mode=char) :

1.é can be replaced "\U00E9" (if you type "\U00E9" in R, you will get "é");

2.ê can be replaced by "\U00EA";

  1. etc.

I do not have access to your PDFs, so I can't test it. If you could provide some links to the pdf you consider for your search, it would be useful.

Emmanuel Hamel
  • 1,769
  • 7
  • 19