2

I have a some files in S3 that look like this (all in the same path):

group1_20210415.csv
group2_20210415.csv
group1_20210416.csv
group2_20210416.csv

The schema for each file is rather simple:

group1_name, group1_id
group2_name, group2_id

I want to be able to query these names and ids from S3 with Athena, and use AWS Glue to crawl that S3 location when new files are present.

Specifically, I want to have a table in Athena with schema:

group1_name, group1_id, group2_name, group2_id, hit_date

My intuition says to use AWS Glue PySpark to combine the data in the S3 files into a single DataFrame, which is simple enough. However, the date for each file exists in the file name itself not in the data.

Is there a way to extract the 'date' part of the filename and use that as a column in the AWS Glue PySpark DataFrame? If not, does anyone's intuition suggest an alternative method?

JYosen
  • 89
  • 7

1 Answers1

2

You could try this: data_frame.withColumn("input_file", input_file_name()) You can then transform this column to extract the date.

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42