0

I have a big pdf file with 100 pages that contains several scanned documents concatenated, I would like to split this big pdf file into smaller ones, each pdf file must contain a document.

Is there a way to detect the start and the end of a document within this big pdf and make the split with R automatically ?

I have imported the pdf file with pdftools::pdf_text, so it shows me the 100 pages but I have no idea how to know when a document starts and ends within this big pdf other than manually.

zx8754
  • 52,746
  • 12
  • 114
  • 209
raph
  • 3
  • 1
  • 3
  • 1
    it sounds like you will need to do some form of document scraping. does each document have page numbers or titles? Perhaps you can use a document scraper to look for these sections in the over all PDF to help you seperate – Spooked Apr 05 '23 at 13:02

0 Answers0