Extract text from pdf by format

Question

I am trying to extract the headlines from pdfs. Until now I tried to read the plain text and take the first line (which didn't work because in plain text the headlines were not at the beginning) and just read the text from a region (which didn't work, because the regions are not always the same).

The easiest way to do this is in my opinion to read just text with a special format (font, fontsize etc.). Is there a way to do this?

You are not telling us how you are trying to extract the text from a PDF. If you are using iTextSharp, this question is a possible duplicate of [Can we use text extraction strategy after applying location extraction strategy in itextpdf?](http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy). In this question, somebody is adding an extra extraction strategy that checks for the font of the text that is extracted in order to filter specific text from a PDF. — Bruno Lowagie, Feb 01 '15 at 11:50
*Is there a way to do this?* - you should mention your PDF library of choice. Any PDF library allowing text extraction should also allow text extraction by font or size.. — mkl, Feb 02 '15 at 05:09
I'm not telling how I am doing it, because it isn't really important for the solution. In this case I don't mind changing the used library. (I used PdfSharp and PDFBox. Both don't seem to have this possiblities...) — derBasti, Feb 02 '15 at 08:44

Bobrovsky · Accepted Answer · 2020-08-07T13:13:19.703

You can enumerate all text objects on a PDF page using Docotic.Pdf library. For each of the text objects information about the font and the size of the object is available. Below is a sample

public static void listTextObjects(string inputPdf)
{
    using (PdfDocument pdf = new PdfDocument(inputPdf))
    {
        string format = "{0}\n{1}, {2}px at {3}";

        foreach (PdfPage page in pdf.Pages)
        {
            foreach (PdfPageObject obj in page.GetObjects())
            {
                if (obj.Type != PdfPageObjectType.Text)
                    continue;

                PdfTextData text = (PdfTextData)obj;

                string message = string.Format(format, text.Text, text.Font.Name,
                    text.Size.Height, text.Position);
                Console.WriteLine(message);
            }
        }
    }
}

The code will output lines like the following for each text object on each page of the input PDF file.

FACTUUR
Helvetica-BoldOblique, 19.04px at { X=51.12; Y=45.54 }

You can use the retrieved information to find largest text or bold text or text with other properties used to format the headline.

If your PDF is guaranteed to have headline as the topmost text on a page than you can use even simpler approach

public static void printText(string inputPdf)
{
    using (PdfDocument pdf = new PdfDocument(inputPdf))
    {
        foreach (PdfPage page in pdf.Pages)
        {
            string text = page.GetTextWithFormatting();
            Console.WriteLine(text);
        }
    }
}

The GetTextWithFormatting method returns text in the reading order (i.e from left top to right bottom position).

Disclaimer: I am one of the developer of the library.

Extract text from pdf by format

1 Answers1