0

Using a Form Parser processor, extracting tables from a pdf page which is rotated by 90 causes the output of duplicated tables. Printing the bounding boxes shows that the tables are correctly detected and separated, but printing the text content shows the same result for different tables. Manually rotating the same file and processing it again produces the correct tables content. Switching from documentai_v1 client to documentai_v1beta3 doesn't change anything.

OS type and version: Windows 10 Python version: 3.9.13 pip version: 22.0.4 google-cloud-documentai version: 1.5.0

Steps to reproduce: Pdf sample file: https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

  1. Send request to documentai (https://cloud.google.com/document-ai/docs/send-request)
  2. Process the output and extract tables (https://cloud.google.com/document-ai/docs/handle-response#tables)
  3. Rotate the pdf manually and repeat steps 1-2
  4. Compare the output tables: the second run will output for the second page two duplicated tables

Image1: bounding boxes detected (red tables, blue paragraphs). We can see that the two tables are correctly detected Image1: bounding boxes detected (red tables, blue paragraphs). We can see that the two tables are correctly detected

Image2: text extracted from the 90 degrees rotated pdf. We can see that documentai has detected 2 tables but the content is duplicated. Image2: text extracted from the 90 degrees rotated pdf. We can see that documentai has detected 2 tables but the content is duplicated.

Image3: text extracted from the straight pdf. We can see that documentai has detected 2 tables and the content of the second one is not duplicated. Image3: text extracted from the straight pdf. We can see that documentai has detected 2 tables and the content of the second one is not duplicated.

Edit: I also ran the test using the REST API from a Google VM (following these steps https://www.cloudskillsboost.google/focuses/21028?) and the result is the same, I still get duplicated tables when the page is rotated 90 degrees. You can download the JSON output from the API here: https://drive.google.com/file/d/1jSAr9r8CjxBkw5M97VzogBoWRWRv7gpy/view?usp=sharing

This is the python code i used to process the JSON:

import json
import pandas as pd

def layout_to_text(layout: dict, text: str) -> str:
    """
    Document AI identifies form fields by their offsets in the entirity of the
    document's text. This function converts offsets to a string.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in layout['textAnchor']['textSegments']:
        start_index = (
            int(segment['startIndex'])
            if segment in layout['textAnchor']['textSegments']
            else 0
        )
        end_index = int(segment['endIndex'])
        response += text[start_index:end_index]
    return response

def table_to_df(table, text):
    """Convert a Document AI table object and return a pandas dataframe.

    Keyword arguments:
    table -- Document AI table object
    text -- Document AI text object from the same document as the table
    """

    
    # get column names
    header = []
    for header_cell in table['headerRows'][0]['cells']:
        header_cell_text = layout_to_text(header_cell['layout'], text)
        header.append(header_cell_text.strip())

    # create pandas df
    table_df = pd.DataFrame(columns=header)

    # add rows to df
    for body_row in table['bodyRows']:
        row_content = []
        for body_cell in body_row['cells']:
            body_cell_text = layout_to_text(body_cell['layout'], text)
            row_content.append(body_cell_text.strip())
        table_df.loc[len(table_df)] = row_content

    return table_df



json_path = r"test\sample_tables_rotated.json"
with open(json_path) as f:
    out_dict = json.load(f)
tables_dfs = []
text = out_dict['document']['text']
for page in out_dict['document']['pages']:
    for table in page['tables']:
        tables_dfs.append(table_to_df(table, text))

for table_df in tables_dfs:
    print(table_df)
    print("------------------------------------------")
Output:

  Number of Coils Number of Paperclips
0               5              3, 5, 4
1              10                7,8,6
2              15           11, 10, 12
3              20           15, 13, 14
------------------------------------------
   Speed (mph)           Driver                         Car     Engine      Date
0      407.447  Craig Breedlove           Spirit of America     GE J47    8/5/63
1      413.199        Tom Green            Wingfoot Express     WE J46   10/2/64
2       434.22       Art Arfons               Green Monster     GE J79   10/5/64
3      468.719  Craig Breedlove           Spirit of America     GE J79  10/13/64
4      526.277  Craig Breedlove           Spirit of America     GE J79  10/15/65
5      536.712       Art Arfons               Green Monster     GE J79  10/27/65
6      555.127  Craig Breedlove  Spirit of America, Sonic 1     GE J79   11/2/65
7      576.553       Art Arfons               Green Monster     GE J79   11/7/65
8      600.601  Craig Breedlove  Spirit of America, Sonic 1     GE J79  11/15/65
9      622.407    Gary Gabelich                  Blue Flame     Rocket  10/23/70
10     633.468    Richard Noble                    Thrust 2  RR RG 146   10/4/83
11     763.035       Andy Green                  Thrust SSC    RR Spey  10/15/97
------------------------------------------
   Speed (mph)           Driver                         Car     Engine      Date
0      407.447  Craig Breedlove           Spirit of America     GE J47    8/5/63
1      413.199        Tom Green            Wingfoot Express     WE J46   10/2/64
2       434.22       Art Arfons               Green Monster     GE J79   10/5/64
3      468.719  Craig Breedlove           Spirit of America     GE J79  10/13/64
4      526.277  Craig Breedlove           Spirit of America     GE J79  10/15/65
5      536.712       Art Arfons               Green Monster     GE J79  10/27/65
6      555.127  Craig Breedlove  Spirit of America, Sonic 1     GE J79   11/2/65
7      576.553       Art Arfons               Green Monster     GE J79   11/7/65
8      600.601  Craig Breedlove  Spirit of America, Sonic 1     GE J79  11/15/65
9      622.407    Gary Gabelich                  Blue Flame     Rocket  10/23/70
10     633.468    Richard Noble                    Thrust 2  RR RG 146   10/4/83
11     763.035       Andy Green                  Thrust SSC    RR Spey  10/15/97
------------------------------------------
  • 1
    Hi @Francesco, I was able to reproduce this issue. I can see you have created a Github issue:https://github.com/googleapis/python-documentai/issues/381. Please note Document AI Engineering team is aware of this issue and further updates will be provided on the Git thread. – Sakshi Gatyan Sep 20 '22 at 09:03
  • Can you clarify if this happens only with the Python client library? Or does it happen with any other method? (e.g. REST API, Cloud Console, etc) – Holt Skinner Sep 22 '22 at 17:28
  • @HoltSkinner I ran the test using the REST API and I can confirm the result is the same, I still get duplicated tables – Francesco Pettini Sep 23 '22 at 13:57

0 Answers0