0

I am attempting to create a Data Asset using the GreatExpectations library to point to all the files in subfolders under a parent folder. Here is a sample code snippet:

asset_name = "iceberg_asset" 
s3_prefix = "folder_a/folder_b/folder_c/" 
batching_regex = r"subfolder_a\/file\.parquet" 

data_asset = datasource.add_parquet_asset(name=asset_name, batching_regex=batching_regex, s3_prefix=s3_prefix)

The batching_regex is supposed to capture all files with a specific full path, which includes the parent folder and file name. However, the current code is not working and returning an error message "file not found." I have confirmed that the regex is working fine.

Currently, only the regexp that matches the files under the s3_prefix is working. Does anyone have any suggestions to get this working for folders and files that match the regexp?

1 Answers1

0

TL;DR: Add s3_recursive_file_discovery=True-parameter to the asset definition.

Longer version

If I understood correctly, you would like GX to recursively find all of the files in the different sub-folders.

This is done be setting s3_recursive_file_discovery parameter to True. This tells GX to recursively find files from all of the sub-folders inside the parent folder.

This would look like:

asset_name = "iceberg_asset" 
s3_prefix = "folder_a/folder_b/folder_c/" 
batching_regex = r"subfolder_a\/file\.parquet" 

data_asset = datasource.add_parquet_asset(
    name=asset_name,
    batching_regex=batching_regex,
    s3_prefix=s3_prefix,
    s3_recursive_file_discovery=True,
)

This makes GX to recursively search all of the sub-folders inside the s3_prefix-folder. GX won't limit checking the files to the regex you provided. GX will find all of the paths in that folder and then drop the paths that don't match the regex pattern.

NOTE! Make sure to evaluate the impact this has on your costs and resources. GX will download all of the matching files from the sub-folders when i.e. running a Data Assistant.

Also, it's good to note that GX will currently (version 0.17.5) validate only the last file of a data asset. GX will use all of the files when i.e. running the Data Assistant.

And finally, by default GX will find 1000 paths. You can increase/decrese this by adding s3_max_keys-parameter to the Data Asset.

Toivo Mattila
  • 377
  • 1
  • 9