1

I am using Ubuntu.

Here is my Image that i get from internet.

My concern is to get data as it is formated in the Image

and dump it into the Text file (position has to be maintained (95-97% accuracy))

I am working with tesseract-ocr

enter image description here

Image-2

almost same question is here

my code-:

import cv2
import pytesseract
from pytesseract import Output
import numpy as np

img = cv2.imread("/demo.jpg")

d1 = pytesseract.image_to_data(img)

print(d1)

It gives me completely a wrong output from what I am expecting

In short, I want to convert this Image(with alignment) to text file (or CSV file).

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
jony
  • 924
  • 10
  • 25

2 Answers2

2

You can use tesseract output in HOCR to retain positional information. Converting these kinds of documents directly into text retaining positional information is a very tricky and hard problem. I can give you an intermediate solution that can give you a data frame with each word and its coordinates so that you can parse it to extract key-value information using the coordinates.

### this will save the tesseract output as "demo.hocr" 
pytesseract.pytesseract.run_tesseract(
            "demo.jpg", "demo",
            extension='.html', lang='eng', config="hocr")

HOCR is an HTML like representation that contains a lot of metadata like line information, word information, its coordinates, etc present. For better handling, I have a parser that will directly parse it and give you a data frame with words and its coordinates. I have created a package in pip called tesseract2dict for this. You can easily install it using pip install tesseract2dict This is how you can use that.

import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
### function 1
### this is for getting word level information as a dataframe
word_dict=td.tess2dict(inputImage,'outputName','outfolder')

### function 2
### this is for getting plain text for a given coordinates as (x,y,w,h)
text_plain=td.word2text(word_dict,(0,0,inputImage.shape[1],inputImage.shape[0]))

PS: This package is only compatible with Tesseract 5.0.0

Sreekiran A R
  • 3,123
  • 2
  • 20
  • 41
2

You can leverage pytesseract parameters to achieve what you're looking for.
More specifically that Output class you imported holds all the supported output types by pytesseract

import cv2
import pytesseract
from pytesseract import Output
import numpy as np

img = cv2.imread("/demo.jpg")

# my favorite type is Output.DICT but since you mentioned CSV
d1 = pytesseract.image_to_data(img, output_type=Output.DATAFRAME)

print(type(d1))
d1.to_csv('ocr_dump.csv')
Karol Żak
  • 2,158
  • 20
  • 24