49

I have different type of invoice files, I want to find table in each invoice file. In this table position is not constant. So I go for image processing. First I tried to convert my invoice into image, then I found contour based on table borders, Finally I can catch table position. For the task I used below code.

with Image(page) as page_image:
    page_image.alpha_channel = False #eliminates transperancy
    img_buffer=np.asarray(bytearray(page_image.make_blob()), dtype=np.uint8)
    img = cv2.imdecode(img_buffer, cv2.IMREAD_UNCHANGED)

    ret, thresh = cv2.threshold(img, 127, 255, 0)
    im2, contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    margin=[]
    for contour in contours:
        # get rectangle bounding contour
        [x, y, w, h] = cv2.boundingRect(contour)
        # Don't plot small false positives that aren't text
        if (w >thresh1 and h> thresh2):
                margin.append([x, y, x + w, y + h])
    #data cleanup on margin to extract required position values.

In this code thresh1, thresh2 i'll update based on the file.

So using this code I can successfully read positions of tables in images, using this position i'll work on my invoice pdf file. For example

Sample 1:

enter image description here

Sample 2:

enter image description here

Sample 3: enter image description here

Output:

Sample 1:

enter image description here

Sample 2:

enter image description here

Sample 3:

enter image description here

But, now I have a new format which doesn't have any borders but it's a table. How to solve this? Because my entire operation depends only on borders of the tables. But now I don't have a table borders. How can I achieve this? I don't have any idea to move out from this problem. My question is, Is there any way to find position based on table structure?.

For example My problem input looks like below:

enter image description here

I would like to find its position like below: enter image description here

How can I solve this? It is really appreciable to give me an idea to solve the problem.

Thanks in advance.

Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
  • Do all tables have the same format? Should the program detect the address as a table – qwr Jun 13 '18 at 06:07
  • @qwr - No it should not detect address as a table. It should detect only table like structure for more precisely it should detect record where it contains more than 1 column. – Mohamed Thasin ah Jun 13 '18 at 06:17
  • 1
    If you have sample images of every type of input that you'll get then your best bet would be to train a neural network. For inspiration look at [this video](https://redd.it/8p9car). – zindarod Jun 13 '18 at 08:06
  • 1
    @zindarod - Thanks for your valuable comment. I was thinking in that way only. If simple image processing doesn't help then I have to move to ML which you have directed. Once again thanks for playing card detection video. Its really cool – Mohamed Thasin ah Jun 13 '18 at 08:12

4 Answers4

71

Vaibhav is right. You can experiment with the different morphological transforms to extract or group pixels into different shapes, lines, etc. For example, the approach can be the following:

  1. Start from the Dilation to convert the text into the solid spots.
  2. Then apply the findContours function as a next step to find text bounding boxes.
  3. After having the text bounding boxes it is possible to apply some heuristics algorithm to cluster the text boxes into groups by their coordinates. This way you can find a groups of text areas aligned into rows and columns.
  4. Then you can apply sorting by x and y coordinates and/or some analysis to the groups to try to find if the grouped text boxes can form a table.

I wrote a small sample illustrating the idea. I hope the code is self explanatory. I've put some comments there too.

import os
import cv2
import imutils

# This only works if there's only one table on a page
# Important parameters:
#  - morph_size
#  - min_text_height_limit
#  - max_text_height_limit
#  - cell_threshold
#  - min_columns


def pre_process_image(img, save_in_file, morph_size=(8, 8)):

    # get rid of the color
    pre = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # Otsu threshold
    pre = cv2.threshold(pre, 250, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    # dilate the text to make it solid spot
    cpy = pre.copy()
    struct = cv2.getStructuringElement(cv2.MORPH_RECT, morph_size)
    cpy = cv2.dilate(~cpy, struct, anchor=(-1, -1), iterations=1)
    pre = ~cpy

    if save_in_file is not None:
        cv2.imwrite(save_in_file, pre)
    return pre


def find_text_boxes(pre, min_text_height_limit=6, max_text_height_limit=40):
    # Looking for the text spots contours
    # OpenCV 3
    # img, contours, hierarchy = cv2.findContours(pre, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
    # OpenCV 4
    contours, hierarchy = cv2.findContours(pre, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

    # Getting the texts bounding boxes based on the text size assumptions
    boxes = []
    for contour in contours:
        box = cv2.boundingRect(contour)
        h = box[3]

        if min_text_height_limit < h < max_text_height_limit:
            boxes.append(box)

    return boxes


def find_table_in_boxes(boxes, cell_threshold=10, min_columns=2):
    rows = {}
    cols = {}

    # Clustering the bounding boxes by their positions
    for box in boxes:
        (x, y, w, h) = box
        col_key = x // cell_threshold
        row_key = y // cell_threshold
        cols[row_key] = [box] if col_key not in cols else cols[col_key] + [box]
        rows[row_key] = [box] if row_key not in rows else rows[row_key] + [box]

    # Filtering out the clusters having less than 2 cols
    table_cells = list(filter(lambda r: len(r) >= min_columns, rows.values()))
    # Sorting the row cells by x coord
    table_cells = [list(sorted(tb)) for tb in table_cells]
    # Sorting rows by the y coord
    table_cells = list(sorted(table_cells, key=lambda r: r[0][1]))

    return table_cells


def build_lines(table_cells):
    if table_cells is None or len(table_cells) <= 0:
        return [], []

    max_last_col_width_row = max(table_cells, key=lambda b: b[-1][2])
    max_x = max_last_col_width_row[-1][0] + max_last_col_width_row[-1][2]

    max_last_row_height_box = max(table_cells[-1], key=lambda b: b[3])
    max_y = max_last_row_height_box[1] + max_last_row_height_box[3]

    hor_lines = []
    ver_lines = []

    for box in table_cells:
        x = box[0][0]
        y = box[0][1]
        hor_lines.append((x, y, max_x, y))

    for box in table_cells[0]:
        x = box[0]
        y = box[1]
        ver_lines.append((x, y, x, max_y))

    (x, y, w, h) = table_cells[0][-1]
    ver_lines.append((max_x, y, max_x, max_y))
    (x, y, w, h) = table_cells[0][0]
    hor_lines.append((x, max_y, max_x, max_y))

    return hor_lines, ver_lines


if __name__ == "__main__":
    in_file = os.path.join("data", "page.jpg")
    pre_file = os.path.join("data", "pre.png")
    out_file = os.path.join("data", "out.png")

    img = cv2.imread(os.path.join(in_file))

    pre_processed = pre_process_image(img, pre_file)
    text_boxes = find_text_boxes(pre_processed)
    cells = find_table_in_boxes(text_boxes)
    hor_lines, ver_lines = build_lines(cells)

    # Visualize the result
    vis = img.copy()

    # for box in text_boxes:
    #     (x, y, w, h) = box
    #     cv2.rectangle(vis, (x, y), (x + w - 2, y + h - 2), (0, 255, 0), 1)

    for line in hor_lines:
        [x1, y1, x2, y2] = line
        cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)

    for line in ver_lines:
        [x1, y1, x2, y2] = line
        cv2.line(vis, (x1, y1), (x2, y2), (0, 0, 255), 1)

    cv2.imwrite(out_file, vis)

I've got the following output:

Sample table extraction

Of course to make the algorithm more robust and applicable to a variety of different input images it has to be adjusted correspondingly.

Update: Updated the code with respect to the OpenCV API changes for findContours. If you have older version of OpenCV installed - use the corresponding call. Related post.

Dmytro
  • 923
  • 1
  • 8
  • 14
  • 1
    Thanks for your detailed approach, it's really appreciated. – Mohamed Thasin ah Aug 09 '18 at 05:06
  • 1
    Awesome man i found the code very useful, is there anyway to extract the exact text from the table as well ? and identify the each data in each cell of the table ? – Sundeep Pidugu Nov 30 '18 at 05:57
  • @Sundeep Of course that's possible. Having the table cell coordinates it's quite easy to extract the cell image from the original image and feed it to OCR engine. I use Google Tesseract OCR engine for that. There's a Python lib `pytesseract` allowing you to invoke the Tesseract executable from Python. – Dmytro Dec 03 '18 at 18:17
  • @Dmytro I'm trying to use your code, but it throws an error.. I'm not familiar with Python so don't really know what each line does.. Hope you could help :) https://stackoverflow.com/questions/54137624/new-to-python-unorderable-types-dict-dict – clarkk Jan 10 '19 at 22:19
  • Hey @clarkk, sorry for the late response. The link to your question gives 404. If you still experiencing errors, could you post the error's message and the stack trace here? – Dmytro Jan 27 '19 at 12:01
  • I'm so impressed this works. Thank you. I had a little trouble with changes to the findContours API as it changed in OpenCV 3 and again in OpenCV 4. Also I don't understand the purpose of "contours = contours[0] if imutils.is_cv2() else contours[1]" and I had to comment this out to get the code to work, which seems wrong. Can anyone explain or point me at something to read? – north at graphviz Jul 09 '19 at 20:11
  • The purpose of "contours = contours[0] if imutils.is_cv2() else contours[1]" is just for compatibility between versions of OpenCV 2 and 3 because the findContours function signature has changed. In OpenCV 2 it returned just an array of contours and in OpenCV 3 it returns a tuple of image, contours array and hierarchy. If you're sure that you always use the same version of OpenCV you can rewrite it accordingly. For OpenCV 3 it would be: image, contours, hierarchy = cv2.findContours(...) – Dmytro Jul 22 '19 at 08:15
  • 1
    @Dmytro is this code still working? because I got the error `cv2.error: OpenCV(4.1.0) /io/opencv/modules/imgproc/src/shapedescr.cpp:743: error: (-215:Assertion failed) npoints >= 0 && (depth == CV_32F || depth == CV_32S) in function 'pointSetBoundingRect'` (inside a cv2 function) when program calls to `box = cv2.boundingRect(contour)` from `text_boxes = find_text_boxes(pre_processed)` – L F Aug 19 '19 at 15:18
  • 2
    @LuisFelipe, the moment I wrote my reply the OpenCV 4 was not released yet, so I worked with OpenCV 3. I'll check my code with the OpenCV 4 and will add an update later. – Dmytro Aug 20 '19 at 09:25
  • going back to opencv3 was the easiest solution – L F Aug 20 '19 at 17:00
  • @Dmytro this works like a charm. But I am trying to make it work for other images as well where like you said, adjustments need to be made accordingly. Would be really helpful if you could give a heads up on how to try making it more generic to fit in other tabular structures as well. – Subigya Upadhyay Oct 17 '19 at 08:10
  • @SubigyaUpadhyay Thanks for the question. It's hard for me to come up with generalization strategy without knowing what the other special cases are. Could you provide the samples you're dealing with? In any case, general approach is to add more dynamics: for instance, converting the constant thresholds and limits to functions from, let's say, input image size... or instead of making strict assumptions about text size - try to detect a table with a range of text size limits. – Dmytro Oct 18 '19 at 13:36
  • @Dmytro Thanks for the suggestion. I will look into it. Meanwhile, here is a sample image. https://ibb.co/9v25mvj – Subigya Upadhyay Oct 24 '19 at 12:29
  • 1
    @SubigyaUpadhyay Sorry for late response. It's clear to me that in your sample the rows and columns have different alignment and the algorithm I provided in the answer assumes top-left text alignment only - that's why it doesn't work here. I'll provide universal solutions later if you're still interested. – Dmytro Nov 25 '19 at 16:45
  • @Dmytro Thanks for the response and yes, it'd be great if you could provide universal solutions. – Subigya Upadhyay Dec 03 '19 at 16:35
  • If the table in the scanned image has dotted borders this code draws a rectangle around all the dots making it as a rectangle-boxed border. Any idea how to make the dotted border into a continuous straight lined border ? – JKC Feb 15 '20 at 18:23
  • @JKC I'd use different strategy in this case: connect the border dots together using erode/dilate and then detect the table by it's borders like I described here https://stackoverflow.com/questions/57210148/what-is-the-best-way-to-extract-text-contained-within-a-table-in-a-pdf-using-pyt/57664735#57664735 – Dmytro Feb 16 '20 at 15:21
  • @Ani I'm half-way there. My original intention was to refactor this sample code to make it well structured and organized and alignment detection and create a repo on GitHub. I have refactored the code and added horizontal alignment detection so far and put it on hold because I don't have enough spare time to work on this at the moment. I'll get back to it as soon as I'm available. Sorry. – Dmytro Mar 27 '20 at 13:27
  • @Dmytro What is `in_file` , `pre.png` and `out.png` here? – Aditya sharma Jun 21 '21 at 15:05
  • @Adityasharma The `in_file` is the input file containing the table you want to be recognized. `pre.png` isn't actually required, it's just for debugging to see intermadiate results. `out.png` is the file where detected table boundaries are drawn. If you only interested in table cells coordinates or the text inside your table - then you don't need the `out.png`. Feel free to modify the code for your needs because I provided it only as proof of concept. – Dmytro Jun 25 '21 at 12:13
  • @Dmytro Your logic works fine with documents where the text is spaced properly, but wherever there are compact tables, the code fails. Also, can I use your logic/code in my model? I'm trying to improve on this and try to modify it so that it can also work for multiple tables. – Piyush Shandilya Sep 03 '21 at 13:42
  • 1
    @PiyushShandilya If the text is more compact the variables in comments (`morph_size`, `min_text_height_limit`, `max_text_height_limit`, `cell_threshold`, `min_columns`) could be tweaked to improve accuracy. They can be even dynamically adjusted somehow. The code I provided is just snippet illustrating one of possible approaches, you can use it any way you want and improve it. – Dmytro Sep 05 '21 at 14:09
  • Did your model have good accuracy? can you share your implementation? I'm trying to accomplish the same and trying to modify this code to suit the needs @PiyushShandilya – Lidor Eliyahu Shelef Nov 21 '22 at 07:31
  • @LidorEliyahuShelef The model worked for documents with well spaced text (as in the images above) but fails for cases where the text is much more compact. The model ends up "identifying" a lot of mini-tables. – Piyush Shandilya Nov 30 '22 at 10:57
  • @PiyushShandilya is there a way or maybe a github link that Ill be able to check? – Lidor Eliyahu Shelef Nov 30 '22 at 11:53
12

You can try applying some morphological transforms (such as Dilation, Erosion or Gaussian Blur) as a pre-processing step before your findContours function

For example

blur = cv2.GaussianBlur(g, (3, 3), 0)
ret, thresh1 = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY)
bitwise = cv2.bitwise_not(thresh1)
erosion = cv2.erode(bitwise, np.ones((1, 1) ,np.uint8), iterations=5)
dilation = cv2.dilate(erosion, np.ones((3, 3) ,np.uint8), iterations=5)

The last argument, iterations shows the degree of dilation/erosion that will take place (in your case, on the text). Having a small value will results in small independent contours even within an alphabet and large values will club many nearby elements. You need to find the ideal value so that only that block of your image gets.

Please note that I've taken 150 as the threshold parameter because I've been working on extracting text from images with varying backgrounds and this worked out better. You can choose to continue with the value you've taken since it's a black & white image.

  • 1
    Thanks for your answer. I can't use contour now. Because I don't have a border in table. my task is to find table like structure :( – Mohamed Thasin ah Jun 13 '18 at 06:27
  • 1
    @MohamedThasinah Contours don't have to work on a border. They will work on the text too and they can use the text layout as a reference to make a box around it. – Vaibhav Mehrotra Jun 13 '18 at 06:35
  • yeah I agree with you. but when i do contour for non border image it will make only based on text. i.e., contour 1 DATE, contour 2 1/02/04 etc., but i want all things in same contour. any way ill try to follow your answer and update you as soon as possible – Mohamed Thasin ah Jun 13 '18 at 06:39
  • @MohamedThasinah if you increase the number of iterations to say 8,10 or 12 it will start grouping closeby contours. I'll be waiting for an update :D – Vaibhav Mehrotra Jun 13 '18 at 07:26
  • 1
    I tried what you have said but it tries to club with non table data :( – Mohamed Thasin ah Jun 13 '18 at 07:33
9

There are many types of tables in the document images with too much variations and layouts. No matter how many rules you write, there will always appear a table for which your rules will fail. These types of problems are genrally solved using ML(Machine Learning) based solutions. You can find many pre-implemented codes on github for solving the problem of detecting tables in the images using ML or DL (Deep Learning).

Here is my code along with the deep learning models, the model can detect various types of tables as well as the structure cells from the tables: https://github.com/DevashishPrasad/CascadeTabNet

The approach achieves state of the art on various public datasets right now (10th May 2020) as far as the accuracy is concerned

More details : https://arxiv.org/abs/2004.12629

Devashish Prasad
  • 1,227
  • 1
  • 13
  • 25
4

this would be helpful for you. I've drawn a bounding box for each word in my invoice, then I will chose only fields that I want. You can use for that ROI (Region Of Interest)

import pytesseract
import cv2

img = cv2.imread(r'path\Invoice2.png')
d = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
n_boxes = len(d['level'])
for i in range(n_boxes):
    (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])    
    img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 1)

cv2.imshow('img', img)
cv2.waitKey(0)

You will get this output: bounding box for each field

TheBotlyNoob
  • 369
  • 5
  • 16
Fahd Zaghdoudi
  • 141
  • 1
  • 13