0

I am just wondering if it is possible to use OCR such as pytesseract to automate covering text on image? I know that pytesseract is able to get the image_to_boxes(), which basically get the box for corresponding character. However, I do not want to cover up all of the character, only as necessary (i.e. part of sensitive information). To find this, I am able to use regex search on the image_to_string() result as below.

ocr_result = pytesseract.image_to_string(Image.open(my_pic))
list(set(re.findall(my_regex, ocr_result)))

However, with image_to_box(), I am not able to find those corresponding character since it is on corresponds to single character, e.g. character 'a', which occurs multiple times in the image and I have no idea how to find the corresponding 'a' character. Below is the example of image_to_boxes() output.

p 1404 1762 1417 1803 0
a 1404 1762 1424 1795 0
...

Is there a way to map the image_to_boxes() into image_to_string() result to get the right character location?

What I am trying to achieve on this is to automate the process to cover the part of text that contains sensitive information with black box. Have anyone ever done this before? Any help would be appreciated.

Darren Christopher
  • 3,893
  • 4
  • 20
  • 37
  • You could create a regex that works on the `image_to_boxes` output like `'^s[ 0-9]+$e[ 0-9]+$c[ 0-9]+$r[ 0-9]+$'` and process all matches further (which contain the positions). – Michael Butscher Apr 22 '19 at 23:59
  • Yes, but the thing is the `image_to_boxes()` output has no whitespaces, which may affect the regex search result on the sensitive info. – Darren Christopher Apr 23 '19 at 00:02

0 Answers0