3

I am writing a script to take scanned pdf files and convert them into lines of text to enter into a database. I use re.findall to get matches from a list of regular expressions to get certain values from the tesseract extracted strings. I am having trouble when a regular expression can't find a match I want it to return "Error." So I can see that there is a problem.

I have tried a handful of if/else statements but I can't seem to get any to notice the None value.

from wand.image import Image as Img
import ghostscript
from PIL import Image
import pytesseract
import re
import os

def get_text_from_pdf(pendingpdf,pendingimg):
    with Img(filename=pendingpdf, resolution=300) as img:
        img.compression_quality = 99
        img.save(filename=pendingimg)
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    extractedtext = pytesseract.image_to_string(Image.open(pendingimg))
    os.unlink(pendingimg)
    return extractedtext

def get_results(vendor,extracted_string,results):
    for v in vendor:
        pattern = re.compile(v)
        for match in re.findall(pattern,extracted_string):
            if type(match) is str:
                results.append(match)
            else:
                results.append("Error")
    return results

pendingpdf = r'J:\TBHscan07022019090315001.pdf'
pendingimg = 'Test1.jpg'
aggind = ["^(\w+)(?:.+)\n+3600",
          "Ticket: (nonsensewordstothrowerror)",
          "Ticket: \d+\s([0-9|/]+)",
          "Product: (\w+.+)\n",
          "Quantity: ([\d\.]+)",
          "Truck (\w+)"]
vendor = aggind
extracted_string = get_text_from_pdf(pendingpdf,pendingimg)
results = []

print(get_results(vendor,get_text_from_pdf(pendingpdf,pendingimg),results))
locke14
  • 1,335
  • 3
  • 15
  • 36
Matthew Keith
  • 53
  • 1
  • 1
  • 4

4 Answers4

3

You could do this in a single line:

results += re.findall(pattern, extracted_string) or ["Error"]

BTW, you get no benefit from compiling the pattern inside the vendor loop because you're only using it once.

Your function could also return the whole search result using a single list comprehension:

return [m for v in vendor for m in re.findall(v, extracted_string) or ["Error"]]

It is a bit weird that you would actually want to modify AND return the results list being passed as parameter. This may produce some unexpected side effects when you use the function.

Your "Error" flag may appear several times in the result list, and given that each pattern may return multiple matches, it will be hard to determine which pattern failed to find a value.

If you only want to signal an error when none of the vendor patterns match, you could use the or ["Error"] trick on whole result:

return [m for v in vendor for m in re.findall(v, extracted_string)] or ["Error"]
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • This works beautifully. I am basically trying to scan a whole bunch of proof of deliveries and enter them in to our accounts payable system. Which is basically a spreadsheet at this point. Anyway, thank you so much! – Matthew Keith Jul 02 '19 at 16:54
2

With such an approach for match in re.findall(pattern,extracted_string):
if re.findall(...) won't find any matches - the for loop won't even run.

Save the result of matching into a variable beforehand, then - check with condition:

...
matches = re.findall(pattern, extracted_string)
if not matches:
    results.append("Error")
else:
    for match in matches:
        results.append(match)

Note, when iterating through results of re.findall(...) the check if type(match) is str: won't make sense as each matched item is a string anyway (otherwise - a more sophisticated analysis of string's content could have been implied).

RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

re.findall returns an empty list when there are no matches. So it should be as simple as:

result = re.findall(my_pattern, my_text)
if result:
    # Successful logic here
else:
    return "Error"
Alex
  • 1,172
  • 11
  • 31
0

You have

for match in re.findall(pattern,extracted_string):
        if type(match) is str:
            results.append(match)
        else:
            results.append("Error")

but re.findall() returns None when it doesn't find anything, so

for match in re.findall(pattern,extracted_string):

won't enter because match is None.

You need to check match is None outside of the for loop.

JRotelli
  • 45
  • 6