0

I was using this code to extract data from my PDF:

tx <- pdf_text("Name.pdf")
tx2 <- unlist(str_split(tx, "[\\r\\n]+"))
tx3 <- str_split_fixed(str_trim(tx2), "\\s{2,}", 5)
write.csv(tx3, file="Path\\ds1.csv")

But this uses End of line to separate the PDF. I want to separate after every paragraph. Is there any other split function i can use to get the data paragraph wise?

  • Can you share on of the pdfs? I think there are a couple of ways but it's hard to guess which one will work withouht an example – JBGruber Sep 19 '19 at 13:35
  • Its a confidential document so won't be able to share that. But for Example, there are 10 subheadings in a document and i want to extract the information only under subheading 3. – Parul Batra Sep 19 '19 at 15:56
  • Then I don't know how to help you. You might look int the function `pdftools::pdf_data()` and see if you can work it out. – JBGruber Sep 19 '19 at 15:58

0 Answers0