Why am I getting duplicate pages extracted from iText7 C#?

Question

I am extracting text from a PDF and have an issue with the same text being returned from sequential pages. I have written a few PDF parsers using iTextSharper and have just ported the following code from iTextSharper to iText7 on the flawed assumption this was only an iTextSharper issue:

        var pdfDocument = new PdfDocument(new PdfReader(@"C:\Temp\MyForm.pdf"));

        for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            var pdfPage = pdfDocument.GetPage(page);
            var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, strategy);

            // Process this page
            Console.WriteLine("PAGE {0}", page);
            Console.WriteLine(currentText);
        }

Is there something I'm missing here?

Unfortunately you don't share the test PDF. One idea: iText text extraction by default ignores whether text is inside the page crop box or outside. Some PDFs have the content of multiple pages on the same content stream and only by different crop boxes select the content of the respective PDF page object. Probably that's the case for your PDFs. If it is, applying a filter to the crop box should fix the issue. If it is not, please share the PDF for analysis. — mkl, Nov 20 '20 at 17:14
Thanks for the response mkl.I'll have to investigate your filter/crop box approach (something I'm not familiar with) Here is the PDF (in the public domain BTW): [link](https://reports.adviserinfo.sec.gov/reports/ADV/285187/PDF/285187.pdf) — Bexbissell, Nov 20 '20 at 17:20

mkl · Accepted Answer · 2020-11-20T18:35:57.843

Actually it is not the same text being returned from sequential pages. Instead you get

the text from page 1 when you extract page 1;
the text from pages 1 and 2 when you extract page 2;
the text from pages 1, 2, and 3 when you extract page 3;
...

Often this happens for code that re-uses a text extraction strategy for multiple pages. But that's not the case in your code, you correctly create a new strategy object for each page. Thus the cause must be in the PDF itself.

And indeed, each page of your document does contain the contents of all previous pages, too, merely outside its crop box. To extract only the text in the respective page crop box you have to filter, e.g. like this:

string SRC = @"285187.pdf";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

Console.WriteLine("\n285187 Filtered\n============\n");

for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    var strategy = new SimpleTextExtractionStrategy();
    var pdfPage = pdfDoc.GetPage(i);

    var filter = new IEventFilter[1];
    filter[0] = new TextRegionEventFilter(pdfPage.GetCropBox());
    var filteredTextEventListener = new FilteredTextEventListener(strategy, filter);

    var currentText = PdfTextExtractor.GetTextFromPage(pdfPage, filteredTextEventListener);

    Console.WriteLine("PAGE {0}", i);
    Console.WriteLine(currentText);
}

pdfDoc.Close();

It is unclear whether the PDF has been created like this by design or by error.

Thanks a lot mkl that solved my issue. I need to read up on crop boxes. Only the SEC can answer your question! — Bexbissell, Nov 20 '20 at 19:53
*"I need to read up on crop boxes"* - as a starter on the boxes read [here](https://stackoverflow.com/a/13240546/1729265). — mkl, Nov 22 '20 at 14:38
"[...] this happens for code that re-uses a text extraction strategy for multiple pages". I was becoming crazy, and you saved me! Thanks! — Vitox, Aug 18 '23 at 22:59

Why am I getting duplicate pages extracted from iText7 C#?

1 Answers1