1

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"

 from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *

data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")

Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job. Can anyone help me on this Thanks

Vidya821
  • 77
  • 2
  • 11

3 Answers3

2

I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job. Here you have the list of them (I am using Deltalake 0.6.1):

  • com.ibm.icu_icu4j-58.2.jar
  • io.delta_delta-core_2.11-0.6.1.jar
  • org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar
  • org.antlr_antlr4-4.7.jar
  • org.antlr_antlr4-runtime-4.7.jar
  • org.antlr_antlr-runtime-3.5.2.jar
  • org.antlr_ST4-4.0.8.jar
  • org.glassfish_javax.json-1.0.4.jar

Then in your Glue job you can use the following code:

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")

from delta.tables import *

glueContext = GlueContext(sc)
spark = glueContext.spark_session

delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)

deltaTable = DeltaTable.forPath(spark, delta_path)
EzuA
  • 56
  • 1
  • 3
  • 1
    Hi, where did you get all these jars, the one i downloaded doesn't have org. or io. in the file name.can you please share the download link ? ---- Getting error as -----> File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o55.addFile. : java.io.FileNotFoundException: File file:/tmp/io.delta_delta-core_2.11-0.6.1.jar does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640) – Vidya821 Sep 08 '20 at 11:53
  • 1
    Hi, you can download the jars executing a spark in local mode, please execute spark with the following version of Delta: `pyspark --packages io.delta:delta-core_2.11:0.6.1` Then go to ivy2 location, there you will find the jars: `/home/your_user/.ivy2/jars` – EzuA Sep 10 '20 at 13:49
  • Latest version of Spark in AWS Glue is 2.4 as of this day, so this is the reason we should use a delta-core version below 0.7. Here is the compatibility table: https://docs.delta.io/0.8.0/releases.html – tholiv Apr 11 '21 at 13:07
1

You need to pass the additional configuration properties

--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
  • Do you mean that the spark config code will look like as spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0","spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension","spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() – Vidya821 Aug 03 '20 at 13:09
  • No actually you have to pass these config as the parameters to the glue job also you have to pass this parameter `'spark.delta.logStore.class','org.apache.spark.sql.delta.storage.S3SingleDriverLogStore'` – Shubham Jain Aug 04 '20 at 05:45
  • Tried with this -----> conf = pyspark.SparkConf() conf.set("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0") conf.set("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension") conf.set("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog") conf.set("spark.delta.logStore.class","org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") spark = SparkSession.builder.appName("MyApp").config(conf=conf).getOrCreate()------Still getting the same error – Vidya821 Sep 08 '20 at 12:48
1

Setting spark.jars.packages in SparkSession.builder.config doesn't work. spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of the spark-submit or pyspark script. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. See https://issues.apache.org/jira/browse/SPARK-21752 for more details.

zsxwing
  • 20,270
  • 4
  • 37
  • 59
  • Hi @zsxwing, do you mean that i have to pass the argument from the glue job parameters and use it in the script, can you please share an example, if you can, not sure how to do it. – Vidya821 Aug 03 '20 at 13:06
  • You can use `pyspark --packages io.delta:delta-core_2.12:0.7.0 ...` or `spark-submit --packages io.delta:delta-core_2.12:0.7.0 ...` – zsxwing Aug 03 '20 at 15:08
  • Tried with below changes but still getting same error ----------> import pyspark import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta-core_2.11:0.6.0 pyspark-shell' conf = pyspark.SparkConf() conf.set("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0") spark = SparkSession.builder.appName("MyApp").config(conf=conf).getOrCreate() – Vidya821 Sep 08 '20 at 12:52