Is there a way to use Apache Hudi on AWS glue?

Question

Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job.

Any blogs/articles which can help here as a starting point ?

So I am not completely sure on your use-case, but this [article](https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/) might be helpful to you. It's about how you can connect to Hudi via Glue Custom connectors. — Robert Kossendey, Apr 28 '21 at 11:09

score 0 · Answer 1 · answered May 05 '21 at 18:50

There is another way which is possible (as per answer from Robert), to include custom jars into the glue job. Then these will be loaded to your glue job and available as in any other hadoop/spark env.

Steps required to achieve this approach are following (at least these work for my pyspark jobs, please correct me if you find some information not exhausting or you have some troubles, I will update my answer):

Note 1: Below is for batch writes, did not test it for hudi streaming
Note 2: Glue job type: Spark, Glue version: 2.0, ETL lang: python

Get all respective jars required by hudi and put them into S3:
- hudi-spark-bundle_2.11
- httpclient-4.5.9
- spark-avro_2.11
When creating glue job (see note 2), specify:
- dependent jars path = comma delimited paths for jars from point no. 1 (e.g. s3://your-bucket/some_prefix/hudi-spark-bundle...jar,s3://your-bucket/some_prefix/http...jar,s3://your-bucket/some_prefix/spark-avro....jar)
Create your script according to the documentation provided within hudi docs and enjoy!

Last note: make sure to assign proper permissions to your glue job

Is there a way to use Apache Hudi on AWS glue?

1 Answers1