PDF to TEXT converted in a wrong way

Question

I am extracting the text from many PDF files using pdfminer. The result text file for some pdf files is strange where each line consits of one character only. Not all of the PDF files but some of them and I still can't find out why and which PDF files will cause this problem.

Here is my code:

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()

    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

And this is on of the PDF files that gave this problem.

Edit

I tried tika but it gave a problem in connection because I am using Django.

Thank you very much

score 0 · Answer 1 · answered Jun 23 '16 at 13:32

0

Use tika it's giving better result for me.

from tika import parser
def pdf_parser_tika(file_pointer):
        parsed = parser.from_file(file_pointer)
        return parsed["content"]

answered Jun 23 '16 at 13:32

Rahul K P

15,740
4
35
52

I gave a problem because I am using it within a website written with Django. – The Maestro Jun 23 '16 at 14:06

PDF to TEXT converted in a wrong way

1 Answers1