1

I am trying to do some text mining of a pdf by searching for certain keywords.

This is my code:

library(pdftools)
library(tidyverse)
library(pdfsearch)

UC_text <- pdf_text("https://wilmar-iframe.todayir.com/attachment/20190411162436345449392_en.pdf") 

result <- keyword_search(UC_text, 
                         keyword = c('SUBSTANTIAL SHAREHOLDERS'),
                         path = TRUE, surround_lines = 1)

However, I got the error message of a filename too long. How can I get over this issue?

Jane
  • 385
  • 4
  • 11

1 Answers1

1

Given the explanation in the cran manual of pdfsearch, you can directly pass the PDF link to the keyword_search(). In this way, I do not see the error message you provided. I rather got the following result.

result <- keyword_search("https://wilmar-iframe.todayir.com/attachment/20190411162436345449392_en.pdf", 
                         keyword = c('SUBSTANTIAL SHAREHOLDERS'),
                         path = TRUE, surround_lines = 1)

  keyword                  page_num line_num line_text token_text
  <chr>                       <int>    <int> <list>    <list>    
1 SUBSTANTIAL SHAREHOLDERS       49     2010 <chr [3]> <list [3]>
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • in this case, I won't be able to parse multiple pdf if i directly pass that link in keyword_search? – Jane Feb 16 '20 at 08:36
  • and even though 'substantial shareholders' appeared multiple times, why is it the results only showed one record? – Jane Feb 16 '20 at 08:37
  • @Jane I think you can parse multiple files if you use loops. As for the result, I have no clue. That document has more than 200 pages. I am afraid I do not have time to look into all pages. What you wanna do is to see how texts are imported to R. As far as I experiences, dealing with PDF text data requires tons of efforts to arrange clean text data. – jazzurro Feb 16 '20 at 09:01