How to get confidence of each line using pytesseract

Question

I have successfully setup Tesseract and can translate the images to text...

text = pytesseract.image_to_string(Image.open(image))

However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?

I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.

Getting the predicted text for all predictions, not just the top predicted would be a huge plus for me. — Michael Higgins, Apr 07 '22 at 10:55

score 23 · Accepted Answer · answered Mar 29 '19 at 01:42

After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...

text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')

So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...

text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)

Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...

conf = text.groupby(['block_num'])['conf'].mean()

Sandipan Dey · Answer 2 · 2021-04-27T05:32:26.400

@Srikar Appalaraju is right. Take the following example image:

Now use the following code:

text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()

Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:

lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
                                     .apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
    
line_conf = []
    
for i in range(len(lines)):
    if lines[i].strip():
        line_conf.append((lines[i], round(confs[i],3)))

with the following desired output:

[('Ying Thai Kitchen', 91.667),
 ('2220 Queen Anne AVE N', 88.2),
 ('Seattle WA 98109', 90.333),
 ('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
 ('‘uw .yingthaikitchen.com', 40.0),
 ('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
 ('Order#:17 Table 2', 94.0),
 ('Date: 7/4/2013 7:28 PM', 86.25),
 ('Server: Jack (1.4)', 83.0),
 ('44 Ginger Lover $9.50', 89.0),
 ('[Pork] [24#]', 43.0),
 ('Brown Rice $2.00', 95.333),
 ('Total 2 iten(s) $11.50', 89.5),
 ('Sales Tax $1.09', 95.667),
 ('Grand Total $12.59', 95.0),
 ('Tip Guide', 95.0),
 ('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
 ('Thank you very much,', 90.75),
 ('Cone back again', 92.667)]

Is there a similar way to get the bounding boxes of each line? — RandomPersonOnline, Jul 25 '22 at 05:47

Srikar Appalaraju · Answer 3 · 2021-04-30T20:01:38.227

The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is

text.groupby(['block_num','par_num','line_num'])['text'].apply(list)

We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?

Column block_num: Block number of the detected text or item
Column par_num: Paragraph number of the detected text or item
Column line_num: Line number of the detected text or item
Column word_num: word number of the detected text or item

But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.

How to get confidence of each line using pytesseract

3 Answers3

Linked