Generate Athena Tables for dynamic Partioned Folder structures on S3

Question

I have a folder struture in S3 which looks like this.

root/
├── parter-1/
|   ├── config/
|   │   ├── config.json
|   │   └── feature.json
|   ├── customer-1
|   |   ├── config/
|   |   │   ├── config.json
|   |   │   └── feature.json
|   |   └── data/
|   |       ├── model-1/
|   |       │   ├── input/ 
|   |       |   |   ├── current/
|   |       |   |   |   ├── tbl1.csv
|   |       |   |   |   └── tbl2.csv
|   |       |   |   └── archive/
|   |       |   |   |   ├── aod=20211012/
|   |       |   |   |   ├── tbl1.csv
|   |       |   |   └── tbl2.csv
|   |       |   |       └── aod=20211210/
|   |       |   |           ├── tbl1.csv
|   |       |   |           └── tbl2.csv
|   |       │   └── output/
|   |       |       └──(Same as input)
|   |       ├── model-2/
|   |       │   └── (Same as model-1)
|   |       └── input.zip
|   ├── customer-2
|   .   └── (Same as customer-1)
|   .
|   | 
.   └── customer-n
.       └── (Same as customer-1)
└── partner-n
    └── (Same as partner-1)

Now, I need to generate athena tables in AWS (for tb1, tbl2) and so on. all files with the sanme tbl1.csv have the same schema (columns) and same is true for tbl2 and so on. I need to completely ignore the config folder, zip file & any json files that are present in the directory.

the final output table needs to be someting like this.

**tbl1**
col_1 | col_2 | col_3 | partner |  customer | model |

score 0 · Answer 1 · answered Oct 26 '21 at 10:01

0

That is not possible.

Amazon Athena will automatically include all files in the location given, including files in all subdirectories.

answered Oct 26 '21 at 10:01

John Rotenstein

241,921
22
380
470

score 0 · Answer 2 · answered Nov 03 '21 at 22:32

Please check the following links to create the tables

Selecting specific files for athena

Partition Athena query by S3 created date

Create a view to expose partner, customer and model from tbl1 as select <all_columns>, regexp_extract("$path", 'root/(.*)/') partner, regexp_extract(....) customer, regexp_extract(....) model from tbl1

score 0 · Answer 3 · answered Feb 16 '23 at 20:24

You can achieve it with AWS Glue crawler.

Take a look how you can set up the crawler on AWS Glue below: https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html

In your case, simply add s3://{YOUR_BUCKET}/root to include all sub folders.

After the initial run of the crawler, you should see multiple tables sometime with one record. You can combine them by navigating to the Configure the crawler's Set output and Scheduling section, select Create a single schema for each S3 path (you will see in Advanced options)

Here is the reference: https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-multiple-tables/

Generate Athena Tables for dynamic Partioned Folder structures on S3

3 Answers3