2

I am looking to extract all different font names of the text in PDF file. I am using iTextSharp DLL, and below given is my code.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace GetFontName
{
    class Program
    {
        static void Main(string[] args)
        {
            PdfReader reader = new PdfReader("C:/Users/agnihotri/Downloads/Test.pdf");
            HashSet<String> names = new HashSet<string>();
            PdfDictionary resources;
            for (int p = 1; p <= reader.NumberOfPages; p++)
            {
                PdfDictionary dic = reader.GetPageN(p);
                resources = dic.GetAsDict(PdfName.RESOURCES);
                if (resources != null)
                {
                    //gets fonts dictionary
                    PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);
                    if (fonts != null)
                    {

                        PdfDictionary font;

                        foreach (PdfName key in fonts.Keys)
                        {
                        font = fonts.GetAsDict(key);
                        string name = font.GetAsName(iTextSharp.text.pdf.PdfName.BASEFONT).ToString();

                            //check for prefix subsetted font

                        if (name.Length > 8 && name.ToCharArray()[7] == '+')
                        {
                        name = String.Format("%s subset (%s)", name.Substring(8), name.Substring(1, 7));

                        }
                        else
                        {
                                //get type of fully embedded fonts
                        name = name.Substring(1);
                        PdfDictionary desc = font.GetAsDict(PdfName.FONTDESCRIPTOR);
                        if (desc == null)
                        name += "no font descriptor";
                        else if (desc.Get(PdfName.FONTFILE) != null)
                        name += "(Type1) embedded";
                        else if (desc.Get(PdfName.FONTFILE2) != null)
                        name += "(TrueType) embedded ";
                        else if (desc.Get(PdfName.FONTFILE3) != null)
                        name += name;//("+font.GetASName(PdfName.SUBTYPE).ToString().SubSTring(1)+")embedded';
                        }

                        names.Add(name);
                        }
                    }
                }
            }
            var collections = from name in names
            select name;
            foreach (string fname in collections)
            {
            Console.WriteLine(fname);
            }
            Console.Read();

        }
    }
}

The output I am getting is "Glyphless Font" no font descriptor" for every pdf file as input. The link for input file is as follows:

https://drive.google.com/open?id=0B6tD8gqVZtLiM3NYMmVVVllNcWc

halfer
  • 19,824
  • 17
  • 99
  • 186
Rahul Agnihotri
  • 93
  • 1
  • 3
  • 9
  • PdfReader reader = new PdfReader("C:/Users/agnihotri/Downloads/Test.pdf"); - Double check the path of the file, that might be the problem since the code seems ok. also I would highly recommend adding some debugging if trying scripts copy pasted from the internet to see that they actually work. – Daniel Netzer Jun 14 '16 at 14:08

2 Answers2

2

I've opened your PDF in Adobe Acrobat and I look at the font panel. This is what I saw:

enter image description here

You have an embedded SubSet of LiberationMono, which means that the name of the font will be stored in the file as ABCDEF+LiberationMono (where ABCDEF is a series of 6 random, but unique characters) because the font is subsetter. See What are the extra characters in the font name of my PDF?

Now let's take a look at the same file opened in iText RUPS:

enter image description here

We find the /Font object and it has a /FontDescriptor. In the /FontDescriptor, we find the /FontName in the format we expected: BAAAAA+LiberationMono.

Now that you know where to look for that name, you can adapt your code.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Thanks for the clarification....do mind help me with the code. i am just fresh bie coding and c# – Rahul Agnihotri Jun 14 '16 at 14:19
  • @Rahul, don't give up at this early juncture! Once you have hints as helpful as this, do please try applying it - it is very good practice. – halfer Jun 14 '16 at 15:13
  • Not sure if i am right track......as got the hint: font.GetAsDict(PdfName.FontDescriptor.FontName); if (desc == null) name += "no font descriptor"; else if (desc.Get(PdfName.FontName) != null) name += "(Type1) embedded"; else if (desc.Get(PdfName.FontName) != null) name += "(TrueType) embedded "; else if (desc.Get(PdfName.FontName) != null) – Rahul Agnihotri Jun 14 '16 at 18:20
  • Please let me know – Rahul Agnihotri Jun 14 '16 at 18:20
2

Running your code with minimal changes I get as output

%s subset (%s)

Actually %s looks like a Java format string, not a .Net format string. Using the more .Net'ish format string {0} subset ({1}) I get

LiberationMono subset (BAAAAA+)

I would propose you use backslashes and the @"..." string form instead of slashes in a file path, e.g. like this

PdfReader reader = new PdfReader(@"C:\Users\agnihotri\Downloads\Test.pdf");

and double check the file name and path --- after all the file you provided is named Hello_World.pdf.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks everyone for the suggestions and help. I have been able to resolve the issue with any changes to the code. The only thing required was to usage of iTextSharp 5.5.9 dll and rest everything was fine. This can be marked as closed – Rahul Agnihotri Jun 15 '16 at 20:15
  • @RahulAgnihotri *The only thing required was to usage of iTextSharp 5.5.9 dll* - Hhmmm, as you did not mention the version you used, you gave the impression you have used the now current version all along... *This can be marked as closed* - You can do that yourself: Create an answer containing the reason (something along the lines of "used old iTextSharp version, works fine with current 5.5.9") and mark that answer as accepted (click on the tick at its upper left). Marking an own answer as accepted may not be possible immediately, but after a few hours it surely is. – mkl Jun 16 '16 at 08:12