I am extracting the text from many PDF files using pdfminer. The result text file for some pdf files is strange where each line consits of one character only. Not all of the PDF files but some of them and I still can't find out why and which PDF files will cause this problem.
Here is my code:
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching,
check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
And this is on of the PDF files that gave this problem.
Edit
I tried tika but it gave a problem in connection because I am using Django.
Thank you very much