1

I am using iTextSharp for PDF processing, and I need to extract all text from an existing PDF that is written in a certain font.

A way to do that is to inherit from a RenderFilter and only allow text that has a certain PostscriptFontName. The problem is that when I do this, I see the following font names in the PDF:

CIDFont+F1
CIDFont+F2
CIDFont+F3
CIDFont+F4
CIDFont+F5

which is nothing like the actual font names I am looking for.

That is, I have not been able to see the actual font names anywhere in the document structure.

Yet, Adobe Acrobat DC does show the correct font names in the Format pane when I select various text boxes on the document canvas (e.g. Arial, Courier New, Roboto), so that information must be stored somewhere.

How do I get those real font names when parsing PDFs with iTextSharp?

GSerg
  • 76,472
  • 17
  • 159
  • 346
  • 1
    Can you share the pdf? I assume the names are only present in the actual embedded font program, not in the pdf metadata for it. Itext (just like adobe acrobat unless you do something forcing it to look into the actual font program) only looks at the font metadata and, therefore, only sees the anonymized names. BTW, this strictly speaking is an error in the pdf. But it's an error usually nobody cares about. – mkl Aug 11 '18 at 10:12
  • @mkl See https://drive.google.com/uc?export=download&id=1GmN2gPvnMoKRmudj9JAEbpeZSEUc782b – GSerg Aug 11 '18 at 12:22
  • 1
    Indeed, the embedded font programs (TTFs etc.) do contain the font names you are looking for. iText does not look into them for names, and strictly speaking there is no need for iText to do so because the PDF specification requires the BaseName you access to be taken from the font program, so your PDF strictly speaking is broken (even though in a way hardly any software will ever complain about). Nonetheless it is possible for you to look into the font programs in your code. – mkl Aug 15 '18 at 15:23

1 Answers1

2

As determined in the course of the comments to the question, the font names are anonymized in all PDF metadata for the font but the embedded font program itself contains the actual font name.

(So the PDF strictly speaking is broken, even though in a way hardly any software will ever complain about.)

If we want to retrieve those names, therefore, we have to look inside these font programs.

Here a proof of concept following the architecture used in this answer you referenced, i.e. using a RenderFilter:

class FontProgramRenderFilter : RenderFilter
{
    public override bool AllowText(TextRenderInfo renderInfo)
    {
        DocumentFont font = renderInfo.GetFont();
        PdfDictionary fontDict = font.FontDictionary;
        PdfName subType = fontDict.GetAsName(PdfName.SUBTYPE);
        if (PdfName.TYPE0.Equals(subType))
        {
            PdfArray descendantFonts = fontDict.GetAsArray(PdfName.DESCENDANTFONTS);
            PdfDictionary descendantFont = descendantFonts[0] as PdfDictionary;
            PdfDictionary fontDescriptor = descendantFont.GetAsDict(PdfName.FONTDESCRIPTOR);
            PdfStream fontStream = fontDescriptor.GetAsStream(PdfName.FONTFILE2);
            byte[] fontData = PdfReader.GetStreamBytes((PRStream)fontStream);
            MemoryStream dataStream = new MemoryStream(fontData);
            dataStream.Position = 0;
            MemoryPackage memoryPackage = new MemoryPackage();
            Uri uri = memoryPackage.CreatePart(dataStream);
            GlyphTypeface glyphTypeface = new GlyphTypeface(uri);
            memoryPackage.DeletePart(uri);
            ICollection<string> names = glyphTypeface.FamilyNames.Values;
            return names.Where(name => name.Contains("Arial")).Count() > 0;
        }
        else
        {
            // analogous code for other font subtypes
            return false;
        }
    }
}

The MemoryPackage class is from this answer which was my first find searching for how to read information from a font in memory using .Net.

Applied to your PDF file like this:

using (PdfReader pdfReader = new PdfReader(SOURCE))
{
    FontProgramRenderFilter fontFilter = new FontProgramRenderFilter();
    ITextExtractionStrategy strategy = new FilteredTextRenderListener(
            new LocationTextExtractionStrategy(), fontFilter);
    Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy));
}

the result is

This is Arial.

Beware: This is a mere proof of concept.

On one hand you will surely also need to implement the part commented as analogous code for other font subtypes above; and even the TYPE0 part is not ready for production use as it only considers FONTFILE2 and does not handle null values gracefully.

On the other hand you will want to cache names for fonts already inspected.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you very much, that is a very useful example. I'm not an expert in how PDF stores fonts in binary streams, and from your comment I assume the code to extract `FamilyNames` or the like would be completely different between `Type 0`, `Type 2`, `Type 3`, `TrueType` and `OpenType`? – GSerg Aug 21 '18 at 08:52
  • *"would be completely different between ..."* - at least located differently. For type 0 with TTF see my code above, for Type 0 with CFF you'd have to inspect **FontFile3** instead of **FontFile2**. For Type 1 or TrueType simple fonts, there is no **DescendantFonts** dictionary, the **FontDescriptor** entry is in the font dictionary itself, and Type 1 fonts use the **FontFile** or **FontFile3** key. For Type 3 fonts there obviously is no font program to inspect. And I have not checked whether `GlyphTypeface` knows to handle Adobe Type 1 fonts and CFF fonts. – mkl Aug 21 '18 at 09:26