Using a Form Parser processor, extracting tables from a pdf page which is rotated by 90 causes the output of duplicated tables. Printing the bounding boxes shows that the tables are correctly detected and separated, but printing the text content shows the same result for different tables. Manually rotating the same file and processing it again produces the correct tables content. Switching from documentai_v1 client to documentai_v1beta3 doesn't change anything.
OS type and version: Windows 10 Python version: 3.9.13 pip version: 22.0.4 google-cloud-documentai version: 1.5.0
Steps to reproduce: Pdf sample file: https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf
- Send request to documentai (https://cloud.google.com/document-ai/docs/send-request)
- Process the output and extract tables (https://cloud.google.com/document-ai/docs/handle-response#tables)
- Rotate the pdf manually and repeat steps 1-2
- Compare the output tables: the second run will output for the second page two duplicated tables
Image1: bounding boxes detected (red tables, blue paragraphs). We can see that the two tables are correctly detected
Image2: text extracted from the 90 degrees rotated pdf. We can see that documentai has detected 2 tables but the content is duplicated.
Image3: text extracted from the straight pdf. We can see that documentai has detected 2 tables and the content of the second one is not duplicated.
Edit: I also ran the test using the REST API from a Google VM (following these steps https://www.cloudskillsboost.google/focuses/21028?) and the result is the same, I still get duplicated tables when the page is rotated 90 degrees. You can download the JSON output from the API here: https://drive.google.com/file/d/1jSAr9r8CjxBkw5M97VzogBoWRWRv7gpy/view?usp=sharing
This is the python code i used to process the JSON:
import json
import pandas as pd
def layout_to_text(layout: dict, text: str) -> str:
"""
Document AI identifies form fields by their offsets in the entirity of the
document's text. This function converts offsets to a string.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in layout['textAnchor']['textSegments']:
start_index = (
int(segment['startIndex'])
if segment in layout['textAnchor']['textSegments']
else 0
)
end_index = int(segment['endIndex'])
response += text[start_index:end_index]
return response
def table_to_df(table, text):
"""Convert a Document AI table object and return a pandas dataframe.
Keyword arguments:
table -- Document AI table object
text -- Document AI text object from the same document as the table
"""
# get column names
header = []
for header_cell in table['headerRows'][0]['cells']:
header_cell_text = layout_to_text(header_cell['layout'], text)
header.append(header_cell_text.strip())
# create pandas df
table_df = pd.DataFrame(columns=header)
# add rows to df
for body_row in table['bodyRows']:
row_content = []
for body_cell in body_row['cells']:
body_cell_text = layout_to_text(body_cell['layout'], text)
row_content.append(body_cell_text.strip())
table_df.loc[len(table_df)] = row_content
return table_df
json_path = r"test\sample_tables_rotated.json"
with open(json_path) as f:
out_dict = json.load(f)
tables_dfs = []
text = out_dict['document']['text']
for page in out_dict['document']['pages']:
for table in page['tables']:
tables_dfs.append(table_to_df(table, text))
for table_df in tables_dfs:
print(table_df)
print("------------------------------------------")
Output:
Number of Coils Number of Paperclips
0 5 3, 5, 4
1 10 7,8,6
2 15 11, 10, 12
3 20 15, 13, 14
------------------------------------------
Speed (mph) Driver Car Engine Date
0 407.447 Craig Breedlove Spirit of America GE J47 8/5/63
1 413.199 Tom Green Wingfoot Express WE J46 10/2/64
2 434.22 Art Arfons Green Monster GE J79 10/5/64
3 468.719 Craig Breedlove Spirit of America GE J79 10/13/64
4 526.277 Craig Breedlove Spirit of America GE J79 10/15/65
5 536.712 Art Arfons Green Monster GE J79 10/27/65
6 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/2/65
7 576.553 Art Arfons Green Monster GE J79 11/7/65
8 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/15/65
9 622.407 Gary Gabelich Blue Flame Rocket 10/23/70
10 633.468 Richard Noble Thrust 2 RR RG 146 10/4/83
11 763.035 Andy Green Thrust SSC RR Spey 10/15/97
------------------------------------------
Speed (mph) Driver Car Engine Date
0 407.447 Craig Breedlove Spirit of America GE J47 8/5/63
1 413.199 Tom Green Wingfoot Express WE J46 10/2/64
2 434.22 Art Arfons Green Monster GE J79 10/5/64
3 468.719 Craig Breedlove Spirit of America GE J79 10/13/64
4 526.277 Craig Breedlove Spirit of America GE J79 10/15/65
5 536.712 Art Arfons Green Monster GE J79 10/27/65
6 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/2/65
7 576.553 Art Arfons Green Monster GE J79 11/7/65
8 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 11/15/65
9 622.407 Gary Gabelich Blue Flame Rocket 10/23/70
10 633.468 Richard Noble Thrust 2 RR RG 146 10/4/83
11 763.035 Andy Green Thrust SSC RR Spey 10/15/97
------------------------------------------