As determined in the course of the comments to the question, the font names are anonymized in all PDF metadata for the font but the embedded font program itself contains the actual font name.
(So the PDF strictly speaking is broken, even though in a way hardly any software will ever complain about.)
If we want to retrieve those names, therefore, we have to look inside these font programs.
Here a proof of concept following the architecture used in this answer you referenced, i.e. using a RenderFilter
:
class FontProgramRenderFilter : RenderFilter
{
public override bool AllowText(TextRenderInfo renderInfo)
{
DocumentFont font = renderInfo.GetFont();
PdfDictionary fontDict = font.FontDictionary;
PdfName subType = fontDict.GetAsName(PdfName.SUBTYPE);
if (PdfName.TYPE0.Equals(subType))
{
PdfArray descendantFonts = fontDict.GetAsArray(PdfName.DESCENDANTFONTS);
PdfDictionary descendantFont = descendantFonts[0] as PdfDictionary;
PdfDictionary fontDescriptor = descendantFont.GetAsDict(PdfName.FONTDESCRIPTOR);
PdfStream fontStream = fontDescriptor.GetAsStream(PdfName.FONTFILE2);
byte[] fontData = PdfReader.GetStreamBytes((PRStream)fontStream);
MemoryStream dataStream = new MemoryStream(fontData);
dataStream.Position = 0;
MemoryPackage memoryPackage = new MemoryPackage();
Uri uri = memoryPackage.CreatePart(dataStream);
GlyphTypeface glyphTypeface = new GlyphTypeface(uri);
memoryPackage.DeletePart(uri);
ICollection<string> names = glyphTypeface.FamilyNames.Values;
return names.Where(name => name.Contains("Arial")).Count() > 0;
}
else
{
// analogous code for other font subtypes
return false;
}
}
}
The MemoryPackage
class is from this answer which was my first find searching for how to read information from a font in memory using .Net.
Applied to your PDF file like this:
using (PdfReader pdfReader = new PdfReader(SOURCE))
{
FontProgramRenderFilter fontFilter = new FontProgramRenderFilter();
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), fontFilter);
Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy));
}
the result is
This is Arial.
Beware: This is a mere proof of concept.
On one hand you will surely also need to implement the part commented as analogous code for other font subtypes
above; and even the TYPE0
part is not ready for production use as it only considers FONTFILE2
and does not handle null
values gracefully.
On the other hand you will want to cache names for fonts already inspected.