0

I am extracting the text from many PDF files using pdfminer. The result text file for some pdf files is strange where each line consits of one character only. Not all of the PDF files but some of them and I still can't find out why and which PDF files will cause this problem.

Here is my code:

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()

    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

And this is on of the PDF files that gave this problem.

Edit

I tried tika but it gave a problem in connection because I am using Django.

The error I am getting

Thank you very much

The Maestro
  • 659
  • 1
  • 5
  • 21

1 Answers1

0

Use tika it's giving better result for me.

from tika import parser
def pdf_parser_tika(file_pointer):
        parsed = parser.from_file(file_pointer)
        return parsed["content"]
Rahul K P
  • 15,740
  • 4
  • 35
  • 52