1

I integrated iText 7 into my project using NuGet and everything seemed to be going well (accurate text results when reading PDFs) but then I noticed that it seems to be reading the 2nd and 3rd pages of the PDF simultaneously and merging them line for line, and sometimes even character for character. Obviously I want the 2nd page read by itself, then the 3rd page, with the results separate not merged. Here's my code:

        public static string ExtractTextFromPdf(string path)
        {
            using (var pdfDocument = new PdfDocument(new PdfReader(path)))
            {
                var stringBuilder = new StringBuilder();
                var strategy = new LocationTextExtractionStrategy();
                for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
                {
                    string text = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
                    stringBuilder.Append(text);
                }
                return stringBuilder.ToString();
            }
        }

This problem is not specific to the PDF I am using. It seems to read the order incorrectly on any PDF I supply it, merging lines sometimes and skipping forward or back. Does anyone have recommendations to improve accuracy or maybe a different PDF reading program that works better? I tried to find Adobe IFilter and all the links seem to be dead.

Darkhydro
  • 1,992
  • 4
  • 24
  • 43

0 Answers0