1

I have an image file which contains characters and numbers in tabular form and I want to write code in python by which, The content of the file can be recognized and save it into a text file in the same order as it is an image file.

Input image file is like this.

EDIT :: This the result after using textcleaner . input file input file output file output text file

Final Edit:: I followed this link for pre-processing of an input image and this is the Link to my code but no improvement in results, so please help me what else I should do get accurate results.

Hasitha Jayawardana
  • 2,326
  • 4
  • 18
  • 36
vishal gupta
  • 31
  • 1
  • 7
  • 2
    So go ahead. Best of luck. – Rahul Jun 30 '17 at 12:23
  • Try : https://pypi.python.org/pypi/pytesseract – Rahul Jun 30 '17 at 12:25
  • pytesseract is not giving correct results, recognition is not perfect and for noisy background results are very bad, printing some other characters, please suggest me something else or tell me how can I reduce the noise of background perfectly, I convert image into gray and then erosion , dilation , and then threshold (all of cv2) and It have added my code also above – vishal gupta Jun 30 '17 at 12:53
  • There are a lot of OCR solutions available on the internet, I am sure. What kind of approaches have you tried? What worked, what didnt? Why? – Vib Jun 30 '17 at 12:58
  • @Vib please tell me any best OCR for my purpose. – vishal gupta Jun 30 '17 at 13:01
  • why in God's name does your image look so aweful? and it wouldn't hurt if you'd actually ask a question if you want an answer. don't you agree? please read [ask] – Piglet Jun 30 '17 at 13:12
  • @Piglet , sorry, actually I am new to `stackoverflow` and I have gone through `How To ask`, and I have edited my question , please answer it . and help – vishal gupta Jun 30 '17 at 13:28
  • I understand you are new. But I dont think you are going to find anyone here who can give you a link to a perfect library that you can just query and get perfect results for each case. OCR will be noisy. You will need to deal with that noise. You said you already tried tesseract, what kind of results did you get? There will probably not be a perfect solution. I know that Microsoft Cognitive Service's API also have OCR functions. Try that maybe? – Vib Jun 30 '17 at 14:12
  • @Vib, `Microgsoft Cognitive Service's API` is not going to help me, I will try to reduce noise of background and try **tesseract** , again. – vishal gupta Jul 01 '17 at 05:52
  • Sounds good, good luck! If something works, do update your question and answer it yourself, it might help someone else who searches for something like this. Thanks. – Vib Jul 01 '17 at 06:10
  • I found `textcleaner` script to reduce noise from the input image and then pass it through the `tasseract` but that is also not giving me accurate results , I set these values for parameters **./textcleaner -g -e normalize -f 30 -o 10 -s 1 **, this was giving me some what better but confusing between characters like `3 and 8` , `1 and 2`, and 'o and 0' also not recognizing some characters so please someone please help me, and tell me what values I should use with `textcleaner' to get accurate result. – vishal gupta Jul 04 '17 at 11:26

0 Answers0