0

I am trying to run the demo code given in PDF parsing of GCP document AI. To run the code, exporting Google credentials as a command line works fine. The problem comes when the code needs to run in memory and hence no credential files are allowed to be accessed from disk. Is there a way to pass the credentials in the document ai parsing function?

The sample code of Google:

def main(project_id='YOUR_PROJECT_ID',
         input_uri='gs://cloud-samples-data/documentai/invoice.pdf'):
    """Process a single document with the Document AI API, including
    text extraction and entity extraction."""

    client = documentai.DocumentUnderstandingServiceClient()
    
    gcs_source = documentai.types.GcsSource(uri=input_uri)

    # mime_type can be application/pdf, image/tiff,
    # and image/gif, or application/json
    input_config = documentai.types.InputConfig(
        gcs_source=gcs_source, mime_type='application/pdf')

    # Location can be 'us' or 'eu'
    parent = 'projects/{}/locations/us'.format(project_id)
    request = documentai.types.ProcessDocumentRequest(
        parent=parent,
        input_config=input_config)

    document = client.process_document(request=request)

    # All text extracted from the document
    print('Document Text: {}'.format(document.text))

    def _get_text(el):
        """Convert text offset indexes into text snippets.
        """
        response = ''
        # If a text segment spans several lines, it will
        # be stored in different text segments.
        for segment in el.text_anchor.text_segments:
            start_index = segment.start_index
            end_index = segment.end_index
            response += document.text[start_index:end_index]
        return response

    for entity in document.entities:
        print('Entity type: {}'.format(entity.type))
        print('Text: {}'.format(_get_text(entity)))
        print('Mention text: {}\n'.format(entity.mention_text))
double-beep
  • 5,031
  • 17
  • 33
  • 41
sentinel
  • 1
  • 1

1 Answers1

0

When you run your workloads on GCP, you don't need to have a service account key file. You MUSTN'T!!

Why? 2 reasons:

  1. It's useless because all GCP products have, at least, a default service account. And most of time, you can customize it. You can have a look on Cloud Function identity in your case.
  2. Service account key file is a file. It means a lot: you can copy it, send it by email, commit it in Git repository... many persons can have access to it and you loose the management of this secret. And because it's a secret, you have to store it securely, you have to rotate it regularly (at least every 90 days, Google recommendation),... It's nighmare! When you can, don't use service account key file!

What the libraries are doing?

  • There are looking if GOOGLE_APPLICATION_CREDENTIALS env var exists.
  • There are looking into the "well know" location (when you perform a gcloud auth application-default login to allow the local application to use your credential to access to Google Resources, a file is created in a "standard location" on your computer)
  • If not, check if the metadata server exists (only on GCP). This server provides the authentication information to the libraries.
  • else raise an error.

So, simply use the correct service account in your function and provide it the correct role to achieve what you want to do.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76