0

I have to process PDF documents through Document AI, I am trying to Batch Processing, but it only allows me to process 50 documents at request, I have run out of ideas of how to process lots of 50 files for each request if all my files are in a same folder in the bucket.

I am trying to extract the information from scanned documents, around 800

JMxnuell
  • 31
  • 2

2 Answers2

1

UPDATE 2: The product has been updated to support up to 1000 documents per batch request. (Note: Individual processors may have different page limits per request)

https://cloud.google.com/document-ai/quotas#content_limits

UPDATE: To make this an easier process, I added a feature to the Document AI Toolbox Python SDK to create batches of documents for Batch Processing.

Refer to this guide for the code sample: https://cloud.google.com/document-ai/docs/send-request#batch-documents


Batch processing currently allows 50 documents per request, with a maximum file size of 1GB and page limits depending on which processor is being used.

https://cloud.google.com/document-ai/quotas#content_limits

You can move your files in Cloud Storage into separate directories of 50 documents each to process the whole directory at once.

You can also divide the requests up by providing specific documents for each request. Use the gcsDocuments parameter instead of gcsPrefix.

https://cloud.google.com/document-ai/docs/send-request#batch-process

You could try something similar to this

def create_batches(
    input_bucket: str,
    input_prefix: str,
    batch_size: int = BATCH_MAX_FILES,
) -> List[List[documentai.GcsDocument]]:
    """
    Create batches of documents to process
    """
    if batch_size > BATCH_MAX_FILES:
        raise ValueError(
            f"Batch size must be less than {BATCH_MAX_FILES}. "
            f"You provided {batch_size}"
        )

    blob_list = storage_client.list_blobs(input_bucket, prefix=input_prefix)

    batches: List[List[documentai.GcsDocument]] = []
    batch: List[documentai.GcsDocument] = []

    for blob in blob_list:

        if blob.content_type not in ACCEPTED_MIME_TYPES:
            logging.error(
                "Invalid Mime Type %s - Skipping file %s", blob.content_type, blob.name
            )
            continue

        if len(batch) == batch_size:
            batches.append(batch)
            batch = []

        batch.append(
            documentai.GcsDocument(
                gcs_uri=f"gs://{input_bucket}/{blob.name}",
                mime_type=blob.content_type,
            )
        )

    batches.append(batch)
    return batches

def batch_process_documents(
    processor: Dict,
    document_batch: List[documentai.GcsDocument],
    gcs_output_uri: str,
    skip_human_review: bool = True,
) -> documentai.BatchProcessMetadata:
    """
    Constructs requests to process documents using the Document AI
    Batch Method.
    Returns Batch Process Metadata
    """
    docai_client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{processor['location']}-documentai.googleapis.com"
        )
    )
    resource_name = docai_client.processor_path(
        processor["project_id"], processor["location"], processor["processor_id"]
    )

    output_config = documentai.DocumentOutputConfig(
        gcs_output_config=documentai.DocumentOutputConfig.GcsOutputConfig(
            gcs_uri=gcs_output_uri
        )
    )

    # Load GCS Input URI into a List of document files
    input_config = documentai.BatchDocumentsInputConfig(
        gcs_documents=documentai.GcsDocuments(documents=document_batch)
    )
    request = documentai.BatchProcessRequest(
        name=resource_name,
        input_documents=input_config,
        document_output_config=output_config,
        skip_human_review=skip_human_review,
    )

    operation = docai_client.batch_process_documents(request)

    # The API supports limited concurrent requests.
    logging.info("Waiting for operation %s to complete...", operation.operation.name)
    # No Timeout Set
    operation.result()

    return documentai.BatchProcessMetadata(operation.metadata)

def main():
    batches = create_batches(gcs_input_bucket, gcs_input_prefix)
    batch_process_results = []

    for i, batch in enumerate(batches):

        if len(batch) <= 0:
            continue

        logging.info("Processing batch %s: %s documents", i, len(batch))

        batch_process_metadata = batch_process_documents(
            processor=processor,
            document_batch=batch,
            gcs_output_uri=gcs_output_uri,
        )

        logging.info(batch_process_metadata.state_message)

        batch_process_results.append(batch_process_metadata)

    print(batch_process_results)
Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • thanks a lot, i will try to implement it. – JMxnuell Feb 28 '23 at 01:20
  • I realized that at the end of the path I needed to add a "/" and that is why I was contemplating more folders. Now I have a problem with my document ai processor, it extracts all the text efficiently but if it finds a table it strangely mixes the contents of said table with the following text bodies – JMxnuell Feb 28 '23 at 01:21
  • 1
    I recommend following the guides here for information on handling the processing response (Including tables) Note - Use the Form Parser processor to detect tables. https://cloud.google.com/document-ai/docs/handle-response If you run into more issues, please make a new post – Holt Skinner Feb 28 '23 at 16:44
0

I divided by batches of 50 documents then I process them by each batch, the problem now is that now I get

InvalidArgument: 400 Request contains an invalid argument. [field_violations {
   field: "Maximum number of input documents that can be specified in a request is restricted to 50"
}

Even though I process only 50 documents, for some reason the processing of some documents is repeated.

Put documents in folders:

from google.cloud import storage

bucket_name = "docstts"
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blobs = bucket.list_blobs()

file_paths = []
for blob in blobs:
    file_paths.append(blob.name)

# Divide los archivos en grupos de 50
file_groups = [file_paths[x:x+50] for x in range(0, len(file_paths), 50)]

# Mueve los archivos en grupos de 50 a diferentes carpetas
for i, file_group in enumerate(file_groups):
    folder_name = f"lote{i}"
    for file_path in file_group:
        blob = bucket.blob(file_path)
        filename = blob.name.split("/")[-1]
        new_blob = bucket.copy_blob(blob, bucket, f"folder_name/{filename}")
        blob.delete()
JMxnuell
  • 31
  • 2