2

I am using iTextSharp PDFReader to read a pdf file that has 18 pages but every time I increment the page number, it starts from the beginning of the pdf instead of reading just that particular page. If I set "x" to the pdfReader.NumberOfPages value, it only reads the last page. I would like to read each page individually and add the data to my list of string s. I am also going through a folder, reading each pdf file, but I am testing with just one at first.

List<string> s = new List<string>();
while (z < filePaths.Count())
{
    PdfReader pdfReader = new PdfReader(filePaths[z]); 
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    for (int x = 1; x <= pdfReader.NumberOfPages; x++)
    {
        string currentText = "";                                
        currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy);                        
        s.Add(currentText);
    }
    z++;
    pdfReader.Close();
}
Petter Hesselberg
  • 5,062
  • 2
  • 24
  • 42
AWooster
  • 105
  • 3
  • 9
  • does it always read the first page only, except for the last page, or does it read everything from first to xth page each? the underlying workhorse method `ProcessContent(int pageNumber, E renderListener)` clearly should do what you intend... which version of ITextSharp do you use? – Cee McSharpface Dec 07 '16 at 20:35
  • using 5.5.10.0, it always starts at the first page and reads until the xth page – AWooster Dec 07 '16 at 20:51
  • just to make sure... do you expect `s` to contain all pages of all files, one page worth of text per list item, when the outer loop is finished? – Cee McSharpface Dec 07 '16 at 21:07
  • Yes, I am wanting to read the pdf page by page and insert each page of text as a list item. – AWooster Dec 07 '16 at 21:10

3 Answers3

5

All previous answers are pretty close, i.e. you were correctly blaming it on some kind of state issue.

The only part that was missing is that it is the strategy variable that remembers its state. After calling GetTextFromPage, your strategy object does not flush its existing contents.

So the trick is to instantiate your strategy inside the loop:

for (int x = 1; x <= pdfReader.NumberOfPages; x++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = "";                                
    currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy);                        
    s.Add(currentText);
}
blagae
  • 2,342
  • 1
  • 27
  • 48
1

Got it to work by removing the strategy from this line PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy)

static void Main(string[] args)
        {
            List<string> filePaths = new List<string>();
            filePaths.Add("C:\\temp\\pe\\ACN-ONFBG-010-R-EN-ONT (1364).pdf");
            filePaths.Add("C:\\temp\\pe\\ACN-ONFBG-010-R-UN-NOR (1364).pdf");
            filePaths.Add("C:\\temp\\pe\\ACN-ONFBG-010-R-UN-SOU (1364).pdf");
            List<string> results = doit(filePaths);
            string stall = "stall";
        }


        private static List<string> doit(List<string> filePaths)
        {
            List<string> s = new List<string>();
            int z = 0;
            while (z < filePaths.Count())
            {
                PdfReader pdfReader = new PdfReader(filePaths[z]);
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                for (int x = 1; x <= pdfReader.NumberOfPages; x++)
                {
                    string currentText = "";
                    currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x);
                    s.Add(currentText);
                }
                z++;
                pdfReader.Close();
            }
            return s;
        }
blaze_125
  • 2,262
  • 1
  • 9
  • 19
  • that invokes the default, `LocationTextExtractionStrategy`, instead of `SimpleTextExtractionStrategy`. Obviously, the latter does not handle page boundaries properly - no idea if by definition, or because of a bug. – Cee McSharpface Dec 07 '16 at 21:30
  • It makes it work regardless. It went from not-working, to working and that's the only change I made to make it work. At this point, I'd lean towards a bug. Bruno is often on here answering iText related question, hopefully he'll see this one and chip in. – blaze_125 Dec 07 '16 at 21:31
  • that's why I upvoted, although the discussion below my answer led to the same conclusion :) let's hope someone from iText reads this, they might explain, or fix. – Cee McSharpface Dec 07 '16 at 21:34
  • 1
    Funny enough... it looks like whenever a third parameter is passed, it stops working. I just tried it again by setting `strategy` to `LocationTextExtractionStrategy` and I'm back to the original behavior posted by the op. – blaze_125 Dec 07 '16 at 21:36
  • This will do the trick in this instance, because iText will create a strategy internally for each page individually (i.e. per `GetTextFromPage` call). See http://stackoverflow.com/a/41038493/2065017 – blagae Dec 08 '16 at 14:58
0

I suspect a reader state issue. Try opening the PdfReader once before the loop to get the page count. Store the page count in a variable. Use that variable as the upper bound for the loop. Then in the loop, instantiate a new PdfReader for every page, dispose it after each iteration.

EDIT: It turned out that the text extraction strategy is the culprit. It retains state somehow. Always instantiate a new SimpleTextExtractionStrategy before calling GetTextFromPage, or omit the strategy parameter - then a new instance of the default implementation of ITextExtractionStrategy will be created internally.

Cee McSharpface
  • 8,493
  • 3
  • 36
  • 77
  • I tried the following and still got the same result as above: for (int x = 1; x <= count; x++) { PdfReader newReader = new PdfReader(filePaths[z]); string currentText = ""; currentText = PdfTextExtractor.GetTextFromPage(newReader, x, strategy); s.Add(currentText); newReader.Dispose(); } – AWooster Dec 07 '16 at 20:55
  • weird. we should make sure that the page dictionary of that PDF isn't broken ... can you try with any other PDF file? and your remark that it works when you set `x` to the number of pages: what happens, if you set it to, let's say, 7? – Cee McSharpface Dec 07 '16 at 20:57
  • I could just use another list of strings and grab the last index of "s", but I am afraid it may not contain all the data from the pdf since string lengths are limited and I am not sure of how many pages may be in any one of these pdfs – AWooster Dec 07 '16 at 20:58
  • Let me clarify, if I set x to 18, the number of pages in this pdf, it only reads the 18th page, not the entire pdf. If I set it to 7, it reads the 7th page and then on the next loop it reads the 7th and 8th pages, and so on... – AWooster Dec 07 '16 at 21:00
  • hard to believe. how do you know? debugger, or output? flaw in output code? if not, [documentation](http://developers.itextpdf.com/reference/com.itextpdf.text.pdf.parser.PdfTextExtractor) would support that your approach is correct and you have a broken PDF or found a bug. – Cee McSharpface Dec 07 '16 at 21:04
  • I am debugging and looking at the currentText variable, I tried a different pdf and still the same issue – AWooster Dec 07 '16 at 21:07
  • then it has to be a reader state issue, otherwise how could it possibly fetch the first-to-fetch correctly, if it is the 18 or the 7? also, there is no overload of that function which would take a from...to page number. try omitting the extraction strategy argument, does that help? – Cee McSharpface Dec 07 '16 at 21:10
  • 1
    That made it work. I had tried this earlier but was getting the same result, I think calling the PDFReader and disposing it each time I read a page fixed the issue – AWooster Dec 07 '16 at 21:15