I'd like to read, for example, GCP BigQuery tables in AWS Glue. I know in Spark is possible to declare dependencies for connecting to specific data-sources. How to do that within the AWS Glue environment and pass such dependencies?
Asked
Active
Viewed 1,251 times
2 Answers
2
In Glue it is possible to start a Spark Session like this
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("my-app") \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.1')\
.getOrCreate()
so for example via the config() method is possible to provide to the Spark session the parameter spark.jars.packages
and specify the package from the Maven repository to use (in this example the one used to connect to Google BigQuery).
But this is not enough, it is also necessary to upload the jar package to S3. Afterwards provide this S3 path to the Glue job as Jar lib path / Dependent jars path

Vzzarr
- 4,600
- 2
- 43
- 80
-
Can you show what parameter in Glue must be used to pass that JAR's `S3` path? – marcin2x4 Nov 14 '22 at 15:26
-
@marcin2x4 AFAIR at the time there was a parameter `Jar lib path` for the Glue Job which would allow you to add the S3 path to the .jar file you have previously stored on S3 – Vzzarr Nov 14 '22 at 15:38
-
They seem to have changed these params to `--extra-jars` and `--user-jars-first` – marcin2x4 Nov 14 '22 at 15:51
0
Also worth mentioning is to use --user-jars-first: "true"
param for Glue job.

marcin2x4
- 1,321
- 2
- 18
- 44
-
Hi @marcin2x4, you know how to do this in Glue notebooks? Using magic or any other method? – Shanga Aug 30 '23 at 15:52