iTextSharp.LGPLv2.Core get text from PDF into a string

Question

recently our project upgraded to a new iTextSharp.LGPLv2.Core v1.6.5. I had a method which extracted a text from the PDF file.

Back then I used this:

        if (File.Exists(pdf1Path))
        {
            var pdfReader = new PdfReader(pdf1Path);
            string pdfText;
            string currentText;

            //Text extracting to List
            for (int i = 1; i <= pdfReader.NumberOfPages; i++)
            {
                currentText = PdfTextExtractor.GetTextFromPage(pdfReader, i);
                currentText =
                    Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8,
                        Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }

            pdfText = text.ToString();
         }

Now suddenly I cannot recognize "PdfTextExtractor". Is there any other option on how to get this working? Note that I am not allowed to install any other libraries or packages.

I tried to use

using iTextSharp.text.pdf.parser;

but it is not recognized anymore. And when I try to download it, it just overlaps with iTextSharp.LGPLv2.Core which gives me an error.

Thanks for the help

Where did you download that version from? Sounds like a 10 year old version, maybe it doesn't have that feature yet? — Amedee Van Gasse, Jun 18 '20 at 12:54
This is used by the company I am working for. The thing is that they upgraded their project and I don´t really know what to do now — Apuna12, Jun 18 '20 at 13:49
@Apumna12 yes and where did your company get it from? There must be someone in your company who knows. — Amedee Van Gasse, Jun 19 '20 at 10:59
I asked them, but no answer... I think I found a work around... See my answer to this question — Apuna12, Jun 19 '20 at 12:36

score 2 · Answer 1 · answered Jun 18 '20 at 12:59

2

IIRC (and if I read the git history correctly), the namespace iTextSharp.text.pdf.parser was introduced in iTextSharp 5.0.2, i.e. it was never part of the LGPL licensed iTextSharp releases.

(The situation differs a bit for the iText/Java releases, here first proofs-of-concept were already present in the latest LGPL releases.)

Thus,

recently our project upgraded to a new iTextSharp.LGPLv2.Core v1.6.5

If you really upgraded, then your previous version appears to have been accompanied by a backported (from version 5.x) or cross-ported (from iText/Java before version 5) parser namespace in it. More likely is, though, that you actually downgraded from an iTextSharp 5.x to a fork based on the iTextSharp 4.2 or earlier, and in downgrading you usually lose features.

I assume you use iTextSharp.LGPLv2.Core to make use of the LGPL instead of the choice between AGPL and commercial license in iTextSharp 5, or you do it for Core support.

If it's really about the license, you only either can try and port the iText/Java parser package in the last LGPL release (2.1.7) or tag (4.2.0), or you can re-implement text extraction completely independently.

If it's about Core support and you would be ready to buy a license or to be subject to the AGPL, you can also try and backport the latest iText 5.x parser namespace. This should be easier than crossporting from Java, and this text extraction code is far more advanced than the iText/Java code from before version 5.

answered Jun 18 '20 at 12:59

mkl

90,588
15
125
265

So from what you wrote here... Is my assumption OK when I say that I use old library? – Apuna12 Jun 18 '20 at 13:50
Well, the description on NuGet is *"iTextSharp.LGPLv2.Core is an unofficial port of the last LGPL version of the iTextSharp (V4.1.6) to .NET Core"*. iTextSharp 4.1.6 has been released July 2009. So yes, the code base is very old. The only advantage as compared to the official iTextSharp 5.x versions may be the focus on .Net Core if you require that. – mkl Jun 18 '20 at 15:39
Then I wonder which exact version of iTextSharp you used before... – mkl Jun 19 '20 at 04:37
We used iTextSharp v5.0.7.0 – Apuna12 Jun 19 '20 at 09:42
You could also migrate to iText 7, which works on Core, but that may involve a bit more work. – Amedee Van Gasse Jun 19 '20 at 11:01
*"We used iTextSharp v5.0.7.0"* - oh, I wasn't aware iTextSharp 5 was Core compatible. Furthermore, I find neither a version 5.0.7 on the [official releases page](https://kb.itextpdf.com/home/it5kb/releases) nor a matching tag in the repositories, in both cases the version goes from 5.0.6 to 5.1.0... Is your 5.0.7 probably a version 5.0.6 someone else made Core compatible? – mkl Jun 19 '20 at 11:19
this is a good question, but I have no idea. But I found a solution and added it here.. see my answer. – Apuna12 Jun 19 '20 at 12:42
@Apuna12 That workaround will only work with very special PDFs. – mkl Jun 19 '20 at 13:32
What do you mean by special? :) – Apuna12 Jun 19 '20 at 20:03
By special I mean: 1) the encoding of the fonts used on the page must be ASCII'ish (which isn't so automatically, more and more often you find fonts with some ad-hoc encoding; 2) the order of text drawing instructions must correspond to the reading order (not always the case, in particular in case of multi column input); 3) gaps between words should be drawn spaces, not cursor repositionings; 4) the text must be drawn directly in the page content and not in some other objects referred to from the page content. – mkl Jun 19 '20 at 21:09
If you have to process pdfs from a single pdf producer (or a few similar ones) only, you may be lucky and get along with your code. Otherwise you will fairly soon have to do with pdfs you cannot properly process. – mkl Jun 19 '20 at 21:25

score 2 · Accepted Answer · answered Jun 19 '20 at 12:38

After some time of digging I found out that there is a work around to this issue. Since I am not allowed to use PdfTextExtractor, I used this code which works nearly the same as mentioned method

        var reader = new PdfReader();
        var pdfFile = createSamplePdfFile();
        var reader = new PdfReader(pdfFile);

        var streamBytes = reader.GetPageContent(1);
        var tokenizer = new PrTokeniser(new RandomAccessFileOrArray(streamBytes));

        var stringsList = new List<string>();
        while (tokenizer.NextToken())
        {
            if (tokenizer.TokenType == PrTokeniser.TK_STRING)
            {
                stringsList.Add(tokenizer.StringValue);
            }
        }

        reader.Close();

Hope this will help someone with similar issue than me :)

Thanks all :)

iTextSharp.LGPLv2.Core get text from PDF into a string

2 Answers2