c# - PdfDocument.GetTextWithFormatting() does not take all pages

Question

I'm trying to open a big PDF file but with this code

using BitMiracle.Docotic.Pdf;

PdfDocument pdf = new PdfDocument("document.pdf")
string document = pdf.GetTextWithFormatting();

the string document take the firsts 87 pages (of 174). Why it takes only the first half of the document?

EDIT: This is an evaluation mode restrictions of the library. There are some alternatives?

Have you tried looping through the pages and getting the text from each page? `pdf.Pages(i).GetText(options)` — Alexander Higgins, Jul 02 '17 at 23:14
*"This is an evaluation mode restrictions of the library. There are some alternatives?"* - Buy a license. Ok, actually you can get a *free time-limited license* (see @Bobrowsky's answer) to check your use case. Afterwards, though, buying a license is the obvious way to go if everything works as desired. — mkl, Jul 03 '17 at 10:14
I'm just trying to get a string from a pdf... I can't beleive that there aren't any free and open source alternatives. — Gicminos, Jul 03 '17 at 10:19
There are free (with different meanings of "free") open source alternatives for getting strings from PDFs (asking for recommendations is off-topic here, though) but `GetTextWithFormatting` sounds like that method returns text plus formatting which in many alternatives requires a bit of programming to provide. — mkl, Jul 03 '17 at 12:09

score 2 · Accepted Answer · answered Jul 03 '17 at 05:24

The behavior you observe is because of evaluation mode restrictions. When used in trial mode, the library imposes the following restrictions:

Documents generated with the library contain an evaluation notice that is printed across each page.
For all existing documents only half of the pages get read by the library.

To evaluate the library without the evaluation mode restrictions you can get a free time-limited license on our site.

Alexander Higgins · Answer 2 · 2017-07-03T00:47:03.340

0

You can try reading the text from each page:

StringBuilder sb = new StringBuilder();
var options = new PdfTextExtractionOptions
                {
                    WithFormatting = false,
                    SkipInvisibleText = true
                };
using (PdfDocument pdf = new PdfDocument("document.pdf"))
{
    int pageIndex = 1;
    foreach(var page in pdf.Pages)
    {
        Console.WriteLine("Page {0}", pageIndex++);
        sb.AppendLine(page.GetText(options));
    }
}
string allText = sb.ToString();

After doing this you should see a line in your console for every page in the pdf.

I could be that pages after 87 don't have text on them. For example, they could be images of scanned pages.

You can test this by trying to select and copy and paste text from the PDF after page 87. If you can then odds are it is a bug in the BitMiracle DLL.

edited Jul 03 '17 at 00:47

answered Jul 02 '17 at 23:17

Alexander Higgins

6,765
1
23
41

Tryied and it stopped at page 87 – Gicminos Jul 02 '17 at 23:25
You only got the text up to page 87 or the loop stopped running at 87? If you run the modified code, what is the last pageIndex that prints? – Alexander Higgins Jul 02 '17 at 23:29
The loop check all 147 pages, but after the page 87 sb gain only "2" lenght for page. – Gicminos Jul 02 '17 at 23:44
I started thinking that the problem is in the pdf document... but I can open it without any problems... – Gicminos Jul 02 '17 at 23:54
Perhaps the pages pages after 87 don't have text on them. For example, they could be images of scanned pages. A good way to tell is to see if you can you select and copy and paste text from the PDF after page 87. If you can then odds are it is a bug in the BitMiracle PDF. – Alexander Higgins Jul 03 '17 at 00:45

c# - PdfDocument.GetTextWithFormatting() does not take all pages

2 Answers2