I am trying to pull all of the text from a PDF file. I am using online PDF's, and they include tables. This code works, however, when it gets to a table in the PDF, the text from the table is printed by columns instead of rows which is messing up my data. Is there a way to have the table be read by rows without having to go through the tables separately? I still need all of the text from the PDF to print together. I am using python.
def getTextFromPDF(url):
open = urllib.request.urlopen(url).read()
memoryFile = io.BytesIO(open)
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with memoryFile as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
return text