I am training a GCP Document AI custom processor to extract data from PDF patent forms. One line in particular is troublesome. On the forms, the Application No./Patent No. is presented as follows: 19165768.3 - 1216 / 3557377 (see attached highlighted screenshot). screenshot with highlighted Application No./Patent No. textFrom this line I would like to extract the Application No. which is the float up until the dash (in the example: 19165768.3) and the Patent No. which is the integer after the forward slash (in the example: 3557377). The problem is that the Application No. often captures the dash and sometimes even the four digits after the dash (e.g. 19165768.3 - or 19165768.3 - 1216). This is even worse for the Patent No. because it almost always captures the four digits, the forward slash, and the patent no. (e.g. 1216/3557377).
I tried a number of approaches:
- increased the number of training documents
- when labeling the training documents I used the 'Select Text Tool' to try and select only the text that I want for each field. The problem is that it often highlights unwanted dash and/or forward slash
- when labeling the training documents I then used the 'Bounding Box' tool to only highlight the Patent No. box but that also usually (9 out 10 times) still highlighted the four digits, the forward slash, and the patent no.
- lastly, I tried to manually delete the four digits and the forward slash from the labels themselves (e.g. the Bounding Box tool selected 1216/3557377 as the label; I manually edited the label value to only be 3557377--the correct patent no.). But this only reduced the F1 score for that label to 0.235 because it usually predicted the label as 1216/3557377 and found that was the prediction was a False Negative/Positive.
I am aware that I can build custom logic on the backend before recording the data into our database to eliminate the dash and/or the forward slash. But I still want to know if there is a way to train the custom model to recognize this data correctly.