Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job.
Any blogs/articles which can help here as a starting point ?
Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job.
Any blogs/articles which can help here as a starting point ?
There is another way which is possible (as per answer from Robert), to include custom jars into the glue job. Then these will be loaded to your glue job and available as in any other hadoop/spark env.
Steps required to achieve this approach are following (at least these work for my pyspark jobs, please correct me if you find some information not exhausting or you have some troubles, I will update my answer):
Note 1: Below is for batch writes, did not test it for hudi streaming
Note 2: Glue job type: Spark, Glue version: 2.0, ETL lang: python
Last note: make sure to assign proper permissions to your glue job