1

I am extracting text from a PDF and have an issue with the same text being returned from sequential pages. I have written a few PDF parsers using iTextSharper and have just ported the following code from iTextSharper to iText7 on the flawed assumption this was only an iTextSharper issue:

        var pdfDocument = new PdfDocument(new PdfReader(@"C:\Temp\MyForm.pdf"));

        for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            var pdfPage = pdfDocument.GetPage(page);
            var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, strategy);

            // Process this page
            Console.WriteLine("PAGE {0}", page);
            Console.WriteLine(currentText);
        }

Is there something I'm missing here?

Bexbissell
  • 25
  • 6
  • Unfortunately you don't share the test PDF. One idea: iText text extraction by default ignores whether text is inside the page crop box or outside. Some PDFs have the content of multiple pages on the same content stream and only by different crop boxes select the content of the respective PDF page object. Probably that's the case for your PDFs. If it is, applying a filter to the crop box should fix the issue. If it is not, please share the PDF for analysis. – mkl Nov 20 '20 at 17:14
  • Thanks for the response mkl.I'll have to investigate your filter/crop box approach (something I'm not familiar with) Here is the PDF (in the public domain BTW): [link](https://reports.adviserinfo.sec.gov/reports/ADV/285187/PDF/285187.pdf) – Bexbissell Nov 20 '20 at 17:20

1 Answers1

1

Actually it is not the same text being returned from sequential pages. Instead you get

  • the text from page 1 when you extract page 1;
  • the text from pages 1 and 2 when you extract page 2;
  • the text from pages 1, 2, and 3 when you extract page 3;
  • ...

Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.

And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:

string SRC = @"285187.pdf";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

Console.WriteLine("\n285187 Filtered\n============\n");

for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    var strategy = new SimpleTextExtractionStrategy();
    var pdfPage = pdfDoc.GetPage(i);

    var filter = new IEventFilter[1];
    filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
    var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);

    var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);

    Console.WriteLine("PAGE {0}", i);
    Console.WriteLine(currentText);
}

pdfDoc.Close();

It is unclear whether the PDF has been created like this by design or by error.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks a lot mkl that solved my issue. I need to read up on crop boxes. Only the SEC can answer your question! – Bexbissell Nov 20 '20 at 19:53
  • *"I need to read up on crop boxes"* - as a starter on the boxes read [here](https://stackoverflow.com/a/13240546/1729265). – mkl Nov 22 '20 at 14:38
  • "[...] this happens for code that re-uses a text extraction strategy for multiple pages". I was becoming crazy, and you saved me! Thanks! – Vitox Aug 18 '23 at 22:59