Python pdfminer LAParams mixes text output

Question

i have a pdf file and i wanna parse text from it with pdfminer.The problem is LAParams sometimes fails and give some portion of the line at the end.I can't figure out why. My pdf looks like this: Out put looks like this: My code is here,thanks in advance:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    caching = True
    pagenos=set()

    for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('C:\\Users\\Vagos\\Desktop\\europe.pdf'))

score 10 · Accepted Answer · edited Dec 30 '22 at 18:34

Found the answer myself.

Issue

The layout-analysis parameters LAParams() (docs for pdfminer.six) default to word_margin of 0.1:

class pdfminer.layout.LAParams(line_overlap: float = 0.5, char_margin: float = 2.0, line_margin: float = 0.5, word_margin: float = 0.1, boxes_flow: Optional[float] = 0.5, detect_vertical: bool = False, all_texts: bool = False)

For inactive pdfminer see source code of LAParams().

My document apparently sometimes had greater word-margins which caused the problems.

Solution

Using LAParams(char_margin = 20) which initiates the char_margin with 20 solved the issue.

Python pdfminer LAParams mixes text output

1 Answers1

Issue

Solution