Multiple S3 Inputs into Glue Pipeline

Question

I have 3 separate data sources (files) in 3 separate S3 buckets. The schema in these 3 sources are different from one another but the timestamp is the same (hourly in epoch).

Previously, I used Glue to read from 1 bucket and apply transformations to the files in that bucket and write to a resulting bucket.

With the 3 data sources, can I still read them from 3 different buckets and somehow join them on the epoch timestamp and then spit out the unified datasource (combination of all 3) .. I guess Glue will have to do row level JOINS in this case.

The blog posts about Glue I have found on the web so far only talk about single source input and transformations.

If this is not possible the way I am asking it? How else would you do it?

jscott · Answer 1 · 2021-08-19T02:30:35.587

0

I'm not quite sure what you're asking, but the Glue Dynamic Dataframe supports a join operation, though it's limited to inner joins. The Spark dataframe has a robust join method that supports inner, outer, and cross joins. So you should be able to load all three S3 locations into (dynamic) dataframes and join them to get a single result set that you can transform and write out.

edited Aug 19 '21 at 02:30

answered Aug 07 '21 at 21:07

jscott

1,011
8
21

is there a blog post or a tutorial you can point me to? thanks! – summerNight Aug 18 '21 at 16:26
https://sparkbyexamples.com/spark/spark-join-multiple-dataframes/ looks decent and includes an example of a three-dataframe join towards the end. If you're not familiar with SQL joins, you may want to do some practice with those just so you're familiar with the concepts. – jscott Aug 19 '21 at 02:32

Multiple S3 Inputs into Glue Pipeline

1 Answers1