How to Read Filename from S3 using AWS Glue ETL Tools

Question

I have a some files in S3 that look like this (all in the same path):

group1_20210415.csv
group2_20210415.csv
group1_20210416.csv
group2_20210416.csv

The schema for each file is rather simple:

group1_name, group1_id
group2_name, group2_id

I want to be able to query these names and ids from S3 with Athena, and use AWS Glue to crawl that S3 location when new files are present.

Specifically, I want to have a table in Athena with schema:

group1_name, group1_id, group2_name, group2_id, hit_date

My intuition says to use AWS Glue PySpark to combine the data in the S3 files into a single DataFrame, which is simple enough. However, the date for each file exists in the file name itself not in the data.

Is there a way to extract the 'date' part of the filename and use that as a column in the AWS Glue PySpark DataFrame? If not, does anyone's intuition suggest an alternative method?

score 2 · Answer 1 · answered Apr 19 '21 at 15:05

2

You could try this: data_frame.withColumn("input_file", input_file_name()) You can then transform this column to extract the date.

answered Apr 19 '21 at 15:05

Robert Kossendey

6,733
2
12
42

How to Read Filename from S3 using AWS Glue ETL Tools

1 Answers1