Blackout number in pdf using OCR

Question

Have 3 pages PDF which has scanned Id card. Id card copy can be on any page I need to blackout Id card number (Format of Id card number - 12 Digits and two spaces i.e xxxx xxxx xxxx)

Please suggest how can i achieve this

I tried microsoft computer vision OCR services but unable to integrate the code

Need to automate this task

Find the Input and expected Output file Input and Outputfile

score 0 · Answer 1 · answered Jun 01 '19 at 14:23

0

I would locate qrcode (there are several packages of zbar for python) and clean area below it - dimension could be calculated by qr code size...

answered Jun 01 '19 at 14:23

user898678

2,994
2
18
17

I am unable to find QR code since it is a scanned image within a pdf can tesseract ocr can provide me co-ordinates of text with in a pdf – Tony Jun 01 '19 at 15:00
Can **tesseract** can give bounding box or coordinates of text elements – Tony Jun 02 '19 at 03:26
Yes it can. You can try to extract lined from this (I can not format is code :-( ): import pytesseract pytesseract.pytesseract.tesseract_cmd = r"/path/bin/tesseract" tessdata_dir_config = r'''--tessdata-dir "path/to/tessdata_best/tessdata"''' data = pytesseract.image_to_data(Image.open('test2.jpg'), lang='eng', config=tessdata_dir_config, output_type=pytesseract.Output.DICT) – user898678 Jun 02 '19 at 15:39

Blackout number in pdf using OCR

1 Answers1