-1

At the moment I am trying to get an idea how to distinguish a character or a number from a simple line. This way I'm trying to filter irrelevant input for Tesseract OCR. My idea is to use connectedComponentsWithStats to get the minimum box around my components and then check how many white or black pixels are in a given bounding box. By setting a BW ratio, I want to find the filled boxes that are the lines I want to filter.

The input I have is a lot of images that only have a letter/character or line rotated on them. I can rotate them by the minimum rectangle but unfortunately I can't crop them. Do you have any hints or maybe a better idea to check the BW ratio in my rotated box?

Rotated component Character

more details

    analysis_of_single_groups = cv2.connectedComponentsWithStats(rotated_without_box, 4, cv2.CV_32S)
    (totalLabels_s_g, label_ids_s_g, values_s_g, centroid_s_g) = analysis_of_single_groups

    for i in range(1, totalLabels_s_g):
        x = values_s_g[i, cv2.CC_STAT_LEFT]
        y = values_s_g[i, cv2.CC_STAT_TOP]
        w = values_s_g[i, cv2.CC_STAT_WIDTH]
        h = values_s_g[i, cv2.CC_STAT_HEIGHT]

    print("x: " + str(x))


    crop_img = rotated_without_box[y:y + h, x:x + w].copy()
    cv2.imwrite("ta/cropped_" + str(i) + ".png", crop_img)

    number_of_white_pix = np.sum(crop_img == 0)  # extracting only white pixels
    number_of_black_pix = np.sum(crop_img == 255)  # extracting only black pixels
    bw_ratio = number_of_white_pix / number_of_black_pix
    bw_ratio < 0.9
  • Not quite clear what you want to do. But what about trying OCR on a region of interest, rotated four ways and keeping the best read ? –  Jul 25 '22 at 18:37

1 Answers1

0

cv2.findContours -> filter contours by hierarchy -> filter by HW ratio

in out

Image 2: out2

import cv2
import numpy as np

gray = cv2.imread("/Users/alex/Downloads/dwojD_2.png", cv2.IMREAD_GRAYSCALE)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# https://docs.opencv.org/4.x/d9/d8b/tutorial_py_contours_hierarchy.html
contours, hierarchy = cv2.findContours(image=thresh, mode=cv2.RETR_TREE, method=cv2.CHAIN_APPROX_NONE)
hierarchy = hierarchy[0]

bgr = cv2.cvtColor(thresh, cv2.COLOR_GRAY2BGR)
for i in range(len(hierarchy)):
    if hierarchy[i][3] >= 0: 
        continue # ignore, some parents are here
    
    rRect = cv2.minAreaRect(contours[i])
    size = rRect[1]
    if min(size) == 0: 
        continue
    ratio = max(size) / min(size)
    print("Min rect size", size, "; Ratio", ratio)

    # you can filter contours by width and height ratio
    isSymbol = ratio < 3
    color = (0, 255, 0) if isSymbol else (0, 0, 255)

    if isSymbol: print("> Symbol!")
    else: print("> Line!")

    
    # cv2.drawContours(image=bgr, contours=contours, contourIdx=i, color=(0, 255, 0), thickness=2, lineType=cv2.LINE_AA)
    box = np.int0(cv2.boxPoints(rRect))
    cv2.drawContours(image=bgr, contours=[box], contourIdx=0, color=color, thickness=2, lineType=cv2.LINE_AA)
    

cv2.imshow("img", bgr)
cv2.waitKey()
Gralex
  • 4,285
  • 7
  • 26
  • 47