I am using Python to BatchProcess PDFs through GCP Document AI ("DocAI"). The PDFs have long file names such as 71.169892_01-2022.10.15-21275188-1111.pdf. Often the only difference between the filenames are the last four digits before .pdf (such as 71.169892_01-2022.10.15-21275188-1111.pdf and 71.169892_01-2022.10.15-21275188-2547.pdf)
When such a PDF is processed through DocAI, it outputs one or more JSON files with a shortened filename such as 71.169892_01-2022.10-0.json, 71.169892_01-2022.10-1.json, and so on. How can I ensure that DocAI does not cut off the filename? Is there an attribute I can add to BatchProcessing Request to ensure that the output preserves the full filename?
This is important because when I process 2 PDFs with nearly identical filenames (e.g. 71.169892_01-2022.10.15-21275188-1111.pdf and 71.169892_01-2022.10.15-21275188-2547.pdf), the resulting JSONs end up with the same filename: 71.169892_01-2022.10-0.json. Which is a problem when such JSONs are moved from the folder where there are automatically stored by DocAI into the same folder (that is--the second JSON simply overwrites the first JSON which has the same name).
The current state is as follows:
Input PDF: 71.169892_01-2022.10.15-21275188-1111.pdf
Output JSON: 71.169892_01-2022.10-0.json
Expecting:
Input PDF: 71.169892_01-2022.10.15-21275188-1111.pdf
Output JSON: 71.169892_01-2022.10.15-21275188-1111.json