Improving the text extraction efficiency using some OCR

Question

I am very new to Computer Vision. I have lots of images like this:

I want to extract the entire table as text. I tried pytesseract to extract text from the image. I tried the sample code as below:

try:
    import Image
except ImportError:
    from PIL import Image
from pytesseract import *

im = Image.open('/home/Downloads/b.png')
text = image_to_string(im, lang='eng')
print text

But results are really bad. Some sample:

II) Han H31 Precvsva 111
II) Pegalran Corn m
11) Quama camume. m
15) Sansmlg Eledra. KR
II) snaru Corn/Japan 11>
II) 15 msnlay Co 1111 KR
13)]ah1lC1rcuvl Inc us
II) Iaman Semioan... 1w
I1)Japan msulay Inc 11>
I1) Schneider Fleck... 511
II) campal Elec|ram 111
II) 5111-9110 onlme 5. JP
I1) C1500 syaens Inc us
Is) Warned Semic. 111
II) Mvcran Techmla. us
I1) Camnuler Sclenc
I1) Flex Lid us
I111me1 Corn 115

How can I improve the efficiency? Can I achieve 80-90% accuracy? All my images are in same format. So can I improve the accuracy for my use-case? Any suggestions will help.

Update: I tried using OCR.space, but it didn't work on the following image at all:

Test

zuphilip · Answer 1 · 2017-03-22T08:57:46.440

The main problem with your image is, that it is only 96 dpi (and OCR often expect 300 dpi). I changed your image to 300 dpi and resample it to 200% with IrfanView using Lancosz algorithm. This should be equivalent to some convert statement.

By taking this new image as input for Tesseract the output looks better:

2?) Hon Hai Precisio...
23) Pegatron Corp

2) Quanta Compute...
25) Samsung Electro...
26) Sharp Corp/Japan
27) LG Display Co Ltd
8) Jabil Circuit Inc
M) Taiwan Semicon...
3) Japan Display Inc
31) Schneider Electr...
3) Compal Electroni...
33) GungHo Online E...
X) Cisco Systems Inc
33) Advanced Semic...
%) Micron Technolo...
3) Computer Scienc...
3) Flex Ltd

3) Intel Corp



TW
TW
TW
LG
JP
LG
US
TW
JP
FR
TW
JP
US
TW
US
US
US
US





1.80%
10.40%
-9.50%

2.72%
-0.57%

5.03%

3.90%

1.38%

1.30%

1.33%
-0.13%
-6.21%

0.31%
-3.63%
-0.20%

1.33%
-1.56%

3.91%



53.67%
60.08%
64.85%

5.97%
27.10%
30.28%
24.00%
16.26%
53.70%

0.92%
14.28%
51.70%

0.73%
39.13%
11.00%

6. 7335
2.61%



5.65B
5.078

1 58B
1.808
1.108
1.028
1.278
70.89M
785.20M
177. 44M
90.56M
925.18M
436.70M
89.24M
411.54M
411.06M



cogs
COGS

COGS
[eels
COGS
[eelcis
COGS
CAPEX
COGS
SG&A
CAPEX
cogs
COGS
SG&A
COGS
[eoles



54.66%
16.33%
14.84%

4. 05%
3.65%
3.30%
3.26%
3.23%
3.00%
2.90%
2.85%
2.28%

1. 503;
1.47%
1.42%
142%



#2015A CF
£2015A CF
#2015A CF
Estimate
#2016A CF
Estimate
#2016A CF
Estimate
#2016A CF
Estimate
Estimate
#2015A CF
Estimate
Estimate
#201701 CF
Estimate
Estimate
Estimate



03/30/2016
03/17/2016
03/31/2016
06/10/2016
06/23/2016
02/24/2017
10/20/2016
05/09/2016
06/21/2016
05/27/2016
10/19/2016
03/22/2016
01/03/2017
02/22/2017
01/09/2017
01/03/2017
01/30/2017
01/03/2017

However, the third column is here ignored completely and some other values are also missing completely. Maybe, the layout recognition has some problems...

can you please share the code to resample it to 200% with IrfanView using Lancosz algorithm? I am very new to OpenCV, and not able to figure out that. — Henil Shah, Apr 24 '17 at 18:28
I used the GUI IrfanView on my Windows machine which provides such a functionality under the menu "Image", cf. http://3.bp.blogspot.com/-AJP_FLR_KQ0/UeLq46oGO9I/AAAAAAAADfY/sJU8GgsJyRs/s1600/IrfanView-Resize.png . (As an alternative I mentioned ImageMagick. A good ressource on that seems here http://www.imagemagick.org/Usage/filter/nicolas/ .) — zuphilip, Apr 25 '17 at 15:20

Improving the text extraction efficiency using some OCR

1 Answers1