0

I have 3 separate data sources (files) in 3 separate S3 buckets. The schema in these 3 sources are different from one another but the timestamp is the same (hourly in epoch).

Previously, I used Glue to read from 1 bucket and apply transformations to the files in that bucket and write to a resulting bucket.

With the 3 data sources, can I still read them from 3 different buckets and somehow join them on the epoch timestamp and then spit out the unified datasource (combination of all 3) .. I guess Glue will have to do row level JOINS in this case.

The blog posts about Glue I have found on the web so far only talk about single source input and transformations.

If this is not possible the way I am asking it? How else would you do it?

summerNight
  • 1,446
  • 3
  • 25
  • 52

1 Answers1

0

I'm not quite sure what you're asking, but the Glue Dynamic Dataframe supports a join operation, though it's limited to inner joins. The Spark dataframe has a robust join method that supports inner, outer, and cross joins. So you should be able to load all three S3 locations into (dynamic) dataframes and join them to get a single result set that you can transform and write out.

jscott
  • 1,011
  • 8
  • 21
  • is there a blog post or a tutorial you can point me to? thanks! – summerNight Aug 18 '21 at 16:26
  • https://sparkbyexamples.com/spark/spark-join-multiple-dataframes/ looks decent and includes an example of a three-dataframe join towards the end. If you're not familiar with SQL joins, you may want to do some practice with those just so you're familiar with the concepts. – jscott Aug 19 '21 at 02:32