Understanding DetectedBreak in google OCR full text annotations

Question

I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.

However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.

I went through This documentation.But I did not understand few of the them.

Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.

EOL_SURE_SPACE
HYPHEN
LINE_BREAK
SPACE
SURE_SPACE
UNKNOWN

Can they be replaced by either a newline char or space ?

The google cloud services have some of the laziest, poorly written docs I've seen out of any software company. The problem repeats itself over and over again with each new google cloud service I try and use. There must be something structurally wrong with Google that would enable this disease to spread from team to team, although I won't speculate further. — John Miller, Apr 19 '23 at 19:32

score 2 · Accepted Answer · answered Nov 02 '18 at 15:56

The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.

import sys
import os
from google.cloud import storage
from google.cloud import vision

def detect_breaks(gcs_image):

    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    image_source = vision.types.ImageSource(
        image_uri=gcs_image)

    image = vision.types.Image(
        source=image_source)

    request = vision.types.AnnotateImageRequest(
        features=[feature], image=image)

    annotation = client.annotate_image(request).full_text_annotation

    breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
    word_text = ""
    for page in annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        word_text += symbol.text
                        if symbol.property.detected_break.type:
                            if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
                                word_text = ""
                            else:
                                print word_text,symbol.property.detected_break
                                word_text = ""

if __name__ == '__main__':
    detect_breaks(sys.argv[1])

The problem is generating input for 6 types of `Break`. Except for `SPACE` and `LINE_BREAK`. And that's exactly is my question. What does those 6 `DETECTED_BREAK` mean? — Arun Gowda, Nov 03 '18 at 07:30
Specifically `EOL_SURE_SPACE` and `HYPHEN` right now, Iam considering them as new lines — Arun Gowda, Nov 03 '18 at 07:31
`EOL_SURE_SPACE` is basically a big eol and `HYPHEN` is that case where a word needs to be broken in the middle with a hyphen (`-`) at the end of a line. Check [this](https://fontshopblog.files.wordpress.com/2013/07/hyphen-4.png) for `HYPHEN` and [this](https://jeroen.github.io/images/testocr.png) for `EOL_SURE_SPACE`. — Lefteris S, Nov 05 '18 at 08:48
So to simply answer my question, is it safe to say `SPACE , SURE_SPACE` can be considered as `SPACE` and rest of them as `NEW_LINE`? — Arun Gowda, Nov 05 '18 at 16:21
I guess that's up to you and your implementation. If you're willing to ignore the hyphen before the line ends or the fact that sure space is larger than a "standard" space then yes, it's safe. — Lefteris S, Nov 06 '18 at 08:52

Understanding DetectedBreak in google OCR full text annotations

1 Answers1