-1

Recently, I took a project. Converting a scanned PDF to searchable PDF/word using Python tesseract.

After few attempts, I could able to convert scanned PDF to PNG image files and afterwards, I'm struck could anyone please help me to convert the PNG files to Word/PDF searchable.my piece of code attached

Please find the attached image for reference.

Import os
Import sys
from PIL import image
Import pytesseract
from pytesseract import image_to_string

 Libpath =r'_______' #site-package
 Pop_path=r'_______' #poppler dlls
 Sys.path.insert(0,LibPath)

  from pdf2image import convert_from_path

     Pdfpath=r'_______' # PDF file directory
     imgpath=r'_______' #image output path

     images= convert_from_path(pdf_path = pdfpath, 
         dpi=500, poppler_path= pop_path)
      for idx, of in enumerate (images):
                 pg.save(imgPath+'PDF_Page_'+'.png',"PNG")
                 print('{} page converted'.format(str(idx)))

       try:
          from PIL import image
       except ImportError:
                 import image
         import pytesseract

     def ocr-core(images):
              Text = 
       pytesseract.image_to_string(image.open(images))
       return text
  print(ocr_core("image path/imagename))

that's it, I've written.....then I got multiple ".PNG" images...now I can only able to convert one PNG images to text.

How to convert all the images and save it in CSV/word?

Dale K
  • 25,246
  • 15
  • 42
  • 71
Deepak
  • 430
  • 1
  • 7
  • 14
  • It has been answered in a different question follow https://stackoverflow.com/q/58627249/12273437 – Deepak Nov 06 '19 at 10:19

1 Answers1

0
  from PIL import image
  from pdf2image import convert_from_path
  import pytesseract
  import OS
  import sys

   Pdf_file_path = '_______' #your file path

  Images = convert_from_path(Pdf_file_path, dpi=500)

Counter=1
for page in Images:
       idx= "image_"+str(Counter)+".jpg" ##or ".png"
       page.save(idx, 'JPEG')
       Counter = Counter+1

 file=Counter-1
  Output= '_____' #where you want to save and file name
 f=open(output, "w")
 for i in range(1,file+1):
          idx= "image_"+str(Counter)+".jpg" ##or ".png"         
 text=str(pytesseract.image_to_string(Image.open(idx)))
     f.write(text)
     f.close()
Deepak
  • 430
  • 1
  • 7
  • 14