I integrated iText 7 into my project using NuGet and everything seemed to be going well (accurate text results when reading PDFs) but then I noticed that it seems to be reading the 2nd and 3rd pages of the PDF simultaneously and merging them line for line, and sometimes even character for character. Obviously I want the 2nd page read by itself, then the 3rd page, with the results separate not merged. Here's my code:
public static string ExtractTextFromPdf(string path)
{
using (var pdfDocument = new PdfDocument(new PdfReader(path)))
{
var stringBuilder = new StringBuilder();
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
string text = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
stringBuilder.Append(text);
}
return stringBuilder.ToString();
}
}
This problem is not specific to the PDF I am using. It seems to read the order incorrectly on any PDF I supply it, merging lines sometimes and skipping forward or back. Does anyone have recommendations to improve accuracy or maybe a different PDF reading program that works better? I tried to find Adobe IFilter and all the links seem to be dead.