0

I am trying to read a newspaper using OCR using tessaract. Before passing the image to tessaract, I am using Kraken to segment the actual lines and draw a line across the sentences for proper detection by tessaract. When passing the image through kraken.pageseg.segment , I am getting an empty list and this output Too many connected components for a page image : 5903 , instead it should have returned a list containg the coordinates of the bounding box around the sentences.

I looked up the source code of kraken and found this perticular error message, but I am unable to understand it. [Source code for error][1]

[1]: https://github.com/mittagessen/kraken/blob/master/kraken/pageseg.py#:~:text=connected%20components%20for%20a-,page,-image%3A%20%7Bccs%7D%27)

2 Answers2

2

I had the same problem and solved it after looking at the Kraken API quickstart guide.

Try changing your image binarization. If you were doing binarization with PIL (Pillow), use the kraken binarization method like this:

from PIL import Image
from kraken import binarization, pageseg

im = Image.open('foo.png')
bw_im = binarization.nlbin(im)
seg_data = pageseg.segment(bw_im)

Reference: https://kraken.re/master/api.html

abear
  • 21
  • 2
0

Try downgrading the package to version "2.0.1"

    pip install kraken==2.0.1

I had the same problem with higher versions and downgrading just solved it.