So I am using pdfminer.six to extract text by a specific font. But currently I have this following problem:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
def extract_text_by_font(pdf_file):
extracted_text = ""
for page_layout in extract_pages(pdf_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
extracted_text += character.get_text()
return extracted_text
If I compare output from this function with from pdfminer.high_level.extract_text
, then extract_text_by_font
does not extract the text properly. For example with pdfminer.high_level.extract_text
I get
"... Hello World..."
but with extract_text_by_font
I get
"...HelloWorld...".
So it removes sometime the whitespaces. Can you fix it?