2

Python 3.6.1 Mac OSX

Regarding Tesseract, I have tried so many different sample/template codes I have found online for PDF -> Text and Image -> Text. None of them seem to work.

Please let me know if you know of a code that works or a website with a good tutorial for either Tesseract, Poppler, or both.

Pytesser seems outdated. Magick seems to be a windows only program. Wand does not seem to help either.

Tesseract-OCR is what I am trying to use but I have no idea how to set up a code for it and cannot find a good tutorial that works. I can only find install tutorials.

I can use Poppler for PDF->Text, but have come across PDF images that I need to extract. I assume I need a separate code for taking the PDF and turning it into an image file and then a code for turning the image to a text file (Teseseract). Or I could use Poppler's PDFImage, which I do not know how to code for (help here would be very much appreciated as well).

My code for Poppler PDF to Text is:

import csv, re, requests, subprocess, sys

url = (
    'http://gwinnetttaxcommissioner.publicaccessnow.com/'
    'Portals/0/PDF/Excess%20funds%20all%20years%20-%20rev02232017.pdf'
)

r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})

filename = url.split('/')[-1].replace('%20', ' ')
with open(filename, 'wb') as fh:
    fh.write(r.content)

subprocess.call(['pdftotext', '-layout', filename])

writer = csv.writer(sys.stdout)
with open(filename[:-3] + 'txt') as fh:
    text = fh.read()
    for line in re.findall('(?m)^\d.+\d$', text):
        writer.writerow(re.split(r' {3,}', line))

And it works great.

I cannot figure out how to format Poppler's PDFImage though.

Additionally, how would I implement something like this in Tesseract, since it is one of the best OCRs?

gmonz
  • 252
  • 1
  • 5
  • 17
  • @Tienkamp thank you, it has to be automated through Python. I will look into the options your shared after work, but do you know how to work Poppler's PDFImage or Tesseract in Python? – gmonz Apr 05 '17 at 18:23
  • Thanks @Tienkamp I saw that one and tried https://pastebin.com/8sFdgStm but could not figure out where to place the image file/URL?? It does not work when placed in URL. – gmonz Apr 06 '17 at 07:42
  • @Tienkamp I figured out I just need to call the url as a function. Can you possibly help me solve my new question regarding the seemingly unlikely error that was brought up in this post: http://stackoverflow.com/questions/43267807/typeerror-initial-value-must-be-str-or-none-not-bytes-during-pytesseract-con ? Thanks a bunch. – gmonz Apr 07 '17 at 00:38

0 Answers0