This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling. How do I get it to return the extracted text the same way it prints it?
import os
import PyPDF2 as pdf
import pandas as pd
def scan_files(root):
for path, subdirs, files in os.walk(root):
for name in files:
if name.endswith('.pdf'):
#print(name)
pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
numPages = pdf.getNumPages()
for p in range(0, numPages):
pages = ''
page = pdf.getPage(p)
pages += page.extractText()
pages = pages.replace('\n', '')
#print(pages)
return pages