0

I work now with Rstudio version 3.4.3 and I tried to analyze a PDF document in French language using package tm.

My problem is that even I specify the language of the document with this command my_pdf <- readPDF (control = list (text = "- layout")) (elem = list (uri = "C: /Users/lo/Desktop/Eau/Catalogs.pdf", language = "fr")). Some words are not well written, for example I find instead of the word "nourrice" the word "nourric" and instead "description" the word "descript".

Do you have any idea how i can solve that?

enter image description here

zx8754
  • 52,746
  • 12
  • 114
  • 209
h.ibn
  • 9
  • 3
  • I think the problem is not related to French language, this is something that can happen when reading pdf (open a pdf and manually copy the text, you'll often see that some words are cut). – nyr1o Mar 07 '18 at 11:43
  • Try reading the pdf with `pdftools::pdftext()` and see if you have the same problem. – phiver Mar 07 '18 at 13:36
  • thank you very much for your help the terms are now correctly displayed using this function pdftext (). but when I try the text cleaning command for example docs <- pdf_text("C:/Users/lo/Desktop/Eau/Catalogues.pdf") docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removeWords, stopwords("french")) – h.ibn Mar 07 '18 at 16:49
  • he shows me the following error Error in UseMethod ("tm_map", x):    no applicable method for 'tm_map' applied to an object of class "character" – h.ibn Mar 07 '18 at 16:50

0 Answers0