2

I need to read the text from some images, the images are clear and very low on noise. So my original thought was that it should be pretty easy to fetch the text. (little that I know)

I tested some python libraries without much success (pytesser) , they would get maybe 10% right. I turned to Googles tesseract-occ but it is still far from good.

Here is one example: enter image description here

and below is the result:

nemnamons

Ill
w_on

lhggerllo
' 59
' as

\M_P2ma\

vuu uu

Cafllode omer
Mom | Dyna
Mom | Dyna

lnggerllo



2vMnne= Tr2rspnn| Factory (Hexmy;

lalgeflll Uxzlconflg
w_o«
w_o«

cammem

What am I doing wrong? Or is OCR recognition really this bad?

theAlse
  • 5,577
  • 11
  • 68
  • 110
  • 1
    I believe you need to process(crop?) your image before you OCR.It won't be able to fetch data directly from a table. If you crop or remove the table border, then you could get somewhat decent output. Also from what I experienced, you could adjust the size of the image for better results. – Chris Aung May 06 '14 at 12:14
  • @ChrisAung, that is the actual picture. I need to do this automatically so I can not modify each picture and remove the borders and so on... – theAlse May 06 '14 at 16:48
  • this link might be helpful http://stackoverflow.com/questions/6173439/can-ocr-software-reliably-read-values-from-a-table – Chris Aung May 07 '14 at 00:54

1 Answers1

1

You will need to pre-process the image, such as remove the noise, in order to get a better result. Later, you can use a library such as pytesseract, to get the text out of your image:

def get_string(img_path):
    img = cv2.imread(img_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)
    cv2.imwrite("removed_noise.png", img)    

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open("removed_noise.png"))

    return result
Snow
  • 1,058
  • 2
  • 19
  • 47