How to pass AWS Glue external Spark packages?

Question

I'd like to read, for example, GCP BigQuery tables in AWS Glue. I know in Spark is possible to declare dependencies for connecting to specific data-sources. How to do that within the AWS Glue environment and pass such dependencies?

score 2 · Answer 1 · answered Apr 07 '21 at 13:41

2

In Glue it is possible to start a Spark Session like this

from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .appName("my-app") \
    .config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1')\
    .getOrCreate()

so for example via the config() method is possible to provide to the Spark session the parameter spark.jars.packages and specify the package from the Maven repository to use (in this example the one used to connect to Google BigQuery).

But this is not enough, it is also necessary to upload the jar package to S3. Afterwards provide this S3 path to the Glue job as Jar lib path / Dependent jars path

answered Apr 07 '21 at 13:41

Vzzarr

4,600
2
43
80

Can you show what parameter in Glue must be used to pass that JAR's `S3` path? – marcin2x4 Nov 14 '22 at 15:26
@marcin2x4 AFAIR at the time there was a parameter `Jar lib path` for the Glue Job which would allow you to add the S3 path to the .jar file you have previously stored on S3 – Vzzarr Nov 14 '22 at 15:38
They seem to have changed these params to `--extra-jars` and `--user-jars-first` – marcin2x4 Nov 14 '22 at 15:51

score 0 · Answer 2 · answered Nov 14 '22 at 22:45

0

Also worth mentioning is to use --user-jars-first: "true" param for Glue job.

answered Nov 14 '22 at 22:45

marcin2x4

1,321
2
18
44

Hi @marcin2x4, you know how to do this in Glue notebooks? Using magic or any other method? – Shanga Aug 30 '23 at 15:52

How to pass AWS Glue external Spark packages?

2 Answers2