0

Have 3 pages PDF which has scanned Id card. Id card copy can be on any page I need to blackout Id card number (Format of Id card number - 12 Digits and two spaces i.e xxxx xxxx xxxx)

Please suggest how can i achieve this

I tried microsoft computer vision OCR services but unable to integrate the code

Need to automate this task

Find the Input and expected Output file Input and Outputfile

Tony
  • 52
  • 15

1 Answers1

0

I would locate qrcode (there are several packages of zbar for python) and clean area below it - dimension could be calculated by qr code size...

user898678
  • 2,994
  • 2
  • 18
  • 17
  • I am unable to find QR code since it is a scanned image within a pdf can tesseract ocr can provide me co-ordinates of text with in a pdf – Tony Jun 01 '19 at 15:00
  • Can **tesseract** can give bounding box or coordinates of text elements – Tony Jun 02 '19 at 03:26
  • Yes it can. You can try to extract lined from this (I can not format is code :-( ): import pytesseract pytesseract.pytesseract.tesseract_cmd = r"/path/bin/tesseract" tessdata_dir_config = r'''--tessdata-dir "path/to/tessdata_best/tessdata"''' data = pytesseract.image_to_data(Image.open('test2.jpg'), lang='eng', config=tessdata_dir_config, output_type=pytesseract.Output.DICT) – user898678 Jun 02 '19 at 15:39