I have a directory containing pdf files. I have written the code that performs OCR when you pass a filename to an object of the wand.image class. What I want to do presently is to loop over the directory of pdf files and generate a OCR'd txt file for each pdf and save it some directory. The code that I have written till now is as follows:
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
pdf = wi(filename = r"D:\files\aba7d525-04b8-4474-a40d-e94f9656ed42.pdf", resolution = 300)
pdfImg = pdf.convert('jpeg')
imgBlobs = []
for img in pdfImg.sequence:
page = wi(image = img)
imgBlobs.append(page.make_blob('jpeg'))
extracted_text = []
for imgBlob in imgBlobs:
im = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(im, lang = 'eng')
extracted_text.append(text)
print(extracted_text[0])
Any suggestions on how I can generate the .txt files from the OCR'd pdf's