0

I understand that I can ask Tesseract to return text back at word level, textline level, paragraph level, block level.

I need to form my own cluster of words, which may be a portion of a text line or include multiple lines. Once I have this cluster of words, I'd like to organize them from left-to-right, top-to-bottom for readability.

I assume Tesseract has this ability since I can get back textline level words in order or paragraph level with words in the right level. Can I access this method from the tess4j API?

Or can someone point me to the algorithm so I can implement it on my own?

Thanks

Edit Here's an example. Suppose my image has this block of text

  John Doe                Adam Paul             Sara Johnson
Vice President         Director of IT      Head of Human Resources
 jdoe@xyz.com           apaul@xyz.com         sjohnson@xyz.com

If I ask tess4j for textline level words, then I get 3 lines:

John Doe Adam Paul Sara Johnson

and

Vice President Director of IT Head of Human Resources

and

jdoe@xyz.com apaul@xyz.com sjohnson@xyz.com

Instead what I want is

John Doe     
Vice President
jdoe@xyz.com

and

Adam Paul
Director of IT
apaul@xyz.com

and

Sara Johnson
Head of Human Resources
sjohnson@xyz.com
kane
  • 5,465
  • 6
  • 44
  • 72
  • Have you tried different PSM modes? – nguyenq Jun 02 '17 at 13:05
  • Unfortunately, the segmentation I need is a bit more complex than what I described and it's not a one-size fits all. Sometimes, I need a whole paragraph and other times, I need the first sentence of the paragraph so I have a special algorithm that clusters my words. I just needed something to display them in a human-readable way. I posted an answer which works reasonably well – kane Jun 02 '17 at 19:32

1 Answers1

2

I wrote my own algorithm which sorts the words. The basic idea is a Comparator that shows words from top-to-bottom, and left-to-right (english language specific of course).

I use the bottom edge (ie minY) of the word for comparing because it should be about the same for words of different sizes while the top edge (ie maxY) may be higher for bigger words.

I also allow for some margin of error in y-axis comparison because the image could be tilted slightly or the OCR decides it wants to draw the bounding box slightly offset. ie. Words may be higher or lower than other words on the same line.

new Comparator<Word>() {
  @Override
  public int compare(Word w1, Word w2) {
    Rectangle b1 = w1.getBoundingBox()
            , b2 = w2.getBoundingBox();

    double yDiff = Math.abs(b1.getMinY() - b2.getMinY());
    double marginDiff = b1.getHeight()/2.0;
    if( yDiff < marginDiff ) {
      int xDiff = Double.compare(b1.getMinX(), b2.getMinX());
      return xDiff;
    } else {
      return Double.compare(b1.getMinY(), b2.getMinY());
    }
  }
}
kane
  • 5,465
  • 6
  • 44
  • 72