6

Right now i am Working on a project in which i have to find the font size of every paragraph in that PDF file. i have tried various python libraries like fitz, PyPDF2, pdfrw, pdfminer, pdfreader. all the libraries fetch the text data but i don't know how to fetch the font size of the paragraphs. thanks in advance..your help is appreciated.

i have tried this but failed to get font size.

import fitz

filepath = '/home/user/Downloads/abc.pdf'
text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.getText()
print(text)
V J
  • 151
  • 1
  • 12
  • @K J yes paragraph is a bunch of text and may contain different heights as well. but is there any way to get those font heights?? – V J Jun 23 '21 at 12:16
  • Does it have to run locally or can you use a cloud service that has a Python library? – joelgeraci Jun 23 '21 at 17:45
  • @joelgeraci yes i am using python library and want to run it locally but what can i do for extracting fontsize from the text of pdffile?? – V J Jun 24 '21 at 05:30
  • Ok - I can't help you if you need it to run locally. Adobe has a SaaS Extract API that will extract text as paragraphs and gives you detailed font information for each including styling within the paragraph. It has a Python SDK but is cloud-based. – joelgeraci Jun 24 '21 at 15:01
  • @joelgeraci thank you for your valuable time and suggestions but i found the solution. – V J Jun 25 '21 at 05:18

2 Answers2

7

I got the solution from pdfminer. The python code for the same is given below.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/path/to/pdf'

Extract_Data=[]

for page_layout in extract_pages(path):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        Font_size=character.size
            Extract_Data.append([Font_size,(element.get_text())])
V J
  • 151
  • 1
  • 12
  • Many thanks. Just regarding the package installation, please note you should install pdfminer.six (pip3 install pdfminer.six ), from https://stackoverflow.com/questions/64948893/pdfminer-high-level-not-showing-up – Behrouz Beheshti Nov 16 '22 at 12:51
0

A better way to do this would be to use fitz itself. This library is significantly faster and cleaner in scraping the font information as compared to pdfminer. An example code snippet is shown below.

import fitz

def scrape(keyword, filePath):
    results = [] # list of tuples that store the information as (text, font size, font name) 
    pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
    for page in pdf:
        dict = page.get_text("dict")
        blocks = dict["blocks"]
        for block in blocks:
            if "lines" in block.keys():
                spans = block['lines']
                for span in spans:
                    data = span['spans']
                    for lines in data:
                        if keyword in lines['text'].lower(): # only store font information of a specific keyword
                            results.append((lines['text'], lines['size'], lines['font']))
                            # lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
    pdf.close()
    return results

If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.

You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.