0

I have a folder struture in S3 which looks like this.

root/
├── parter-1/
|   ├── config/
|   │   ├── config.json
|   │   └── feature.json
|   ├── customer-1
|   |   ├── config/
|   |   │   ├── config.json
|   |   │   └── feature.json
|   |   └── data/
|   |       ├── model-1/
|   |       │   ├── input/ 
|   |       |   |   ├── current/
|   |       |   |   |   ├── tbl1.csv
|   |       |   |   |   └── tbl2.csv
|   |       |   |   └── archive/
|   |       |   |   |   ├── aod=20211012/
|   |       |   |   |   ├── tbl1.csv
|   |       |   |   └── tbl2.csv
|   |       |   |       └── aod=20211210/
|   |       |   |           ├── tbl1.csv
|   |       |   |           └── tbl2.csv
|   |       │   └── output/
|   |       |       └──(Same as input)
|   |       ├── model-2/
|   |       │   └── (Same as model-1)
|   |       └── input.zip
|   ├── customer-2
|   .   └── (Same as customer-1)
|   .
|   | 
.   └── customer-n
.       └── (Same as customer-1)
└── partner-n
    └── (Same as partner-1)       

Now, I need to generate athena tables in AWS (for tb1, tbl2) and so on. all files with the sanme tbl1.csv have the same schema (columns) and same is true for tbl2 and so on. I need to completely ignore the config folder, zip file & any json files that are present in the directory.

the final output table needs to be someting like this.

**tbl1**
col_1 | col_2 | col_3 | partner |  customer | model | 



Mohan
  • 4,677
  • 7
  • 42
  • 65

3 Answers3

0

That is not possible.

Amazon Athena will automatically include all files in the location given, including files in all subdirectories.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
0

Please check the following links to create the tables

Selecting specific files for athena

Partition Athena query by S3 created date

Create a view to expose partner, customer and model from tbl1 as select <all_columns>, regexp_extract("$path", 'root/(.*)/') partner, regexp_extract(....) customer, regexp_extract(....) model from tbl1

0

You can achieve it with AWS Glue crawler.

Take a look how you can set up the crawler on AWS Glue below: https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html

In your case, simply add s3://{YOUR_BUCKET}/root to include all sub folders.

After the initial run of the crawler, you should see multiple tables sometime with one record. You can combine them by navigating to the Configure the crawler's Set output and Scheduling section, select Create a single schema for each S3 path (you will see in Advanced options)

enter image description here

Here is the reference: https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-multiple-tables/

Hiro
  • 353
  • 5
  • 12