I have a some files in S3 that look like this (all in the same path):
group1_20210415.csv
group2_20210415.csv
group1_20210416.csv
group2_20210416.csv
The schema for each file is rather simple:
group1_name, group1_id
group2_name, group2_id
I want to be able to query these names and ids from S3 with Athena, and use AWS Glue to crawl that S3 location when new files are present.
Specifically, I want to have a table in Athena with schema:
group1_name, group1_id, group2_name, group2_id, hit_date
My intuition says to use AWS Glue PySpark to combine the data in the S3 files into a single DataFrame, which is simple enough. However, the date for each file exists in the file name itself not in the data.
Is there a way to extract the 'date' part of the filename and use that as a column in the AWS Glue PySpark DataFrame? If not, does anyone's intuition suggest an alternative method?