0

In MWAA, I am using the following code to access the files in my S3 bucket. The S3 bucket is of the following form:

aws s3 ls s3://example-bucket/incoming/driver-events/ingestDate=2021-05-26/

The above command works fine. Now I am attempting to get the same information from an S3_hook.S3Hook() call from Airflow. I have the following code:

bucket='s3://example-bucket/incoming/driver-events/ingestDate=2021-05-26/'
s3_handle = S3_hook.S3Hook(aws_conn_id='s3_default')
        key_list = s3_handle.list_keys(bucket_name=bucket)
        print(f"{len(key_list)} keys found in bucket")
        for keys in key_list:
            logging.info(keys)

This is resulting in an error from boto3:

botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid bucket name "s3://example-bucket/incoming/driver-events/ingestDate=2021-05-26/": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

I can understand that the error is coming because boto3 is attempting to do some parameter validation and the regular expression is too restrictive.

How do I handle this case in Airflow? Is there any way I can disable the parameter validation? I can see that one can set 'parameter_validation' to False in boto3 through some configuration setting, but how do I do that when using an S3Hook() in Airflow that is already set up in its default way and cannot accept a boto3 configuration? And making it more complicated is that I have to do it on MWAA which does not give you any control over ~/.boto/ folder.

AnupamB
  • 31
  • 4

1 Answers1

0

The problem with this is that bucket should only be example-bucket and the path should be the prefix provided to the function call to list objects. S3 does not store the data as folders and files, instead they are all key-value pairs. Thus the key is the full directory structure of the path.

The code block that works is as follows:

bucket_prefix = 'incoming/driver-events/ingestDate=2021-05-26/'
client = boto3.client('s3', 
        aws_access_key_id=Variable.get("AWS_ACCESS_KEY_ID"), 
        aws_secret_access_key=Variable.get("AWS_SECRET_ACCESS_KEY"))
        response = client.list_objects_v2(
            Bucket='example-bucket',
            Delimiter='/',
            Prefix=bucket_prefix,
            MaxKeys=1000,
        )
        print(response)
        contents = response["Contents"]  # These are the files
        common_prefixes = response["CommonPrefixes"] # These are the folders

AnupamB
  • 31
  • 4