I'm trying to get my PySpark to work with Delta table.
I did "pip install delta" as well as "pip install delta-spark"
This is my delta.py script:
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")
# Update every even value by adding 100 to it
deltaTable.update(
condition = expr("id % 2 == 0"),
set = { "id": expr("id + 100") })
# Delete every even value
deltaTable.delete(condition = expr("id % 2 == 0"))
# Upsert (merge) new data
newData = spark.range(0, 20)
deltaTable.alias("oldData") \
.merge(
newData.alias("newData"),
"oldData.id = newData.id") \
.whenMatchedUpdate(set = { "id": col("newData.id") }) \
.whenNotMatchedInsert(values = { "id": col("newData.id") }) \
.execute()
deltaTable.toDF().show()
Here is my spark-submit command:
spark-submit --packages io.delta:delta-core_2.12:0.7.0 --master local[*] --executor-memory 2g delta.py
Here is the output containing the error:
:: loading settings :: url = jar:file:/mnt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/eugene/.ivy2/cache
The jars for the packages stored in: /home/eugene/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c6184f38-2d95-498c-b711-ead1c4e98cdc;1.0
confs: [default]
found io.delta#delta-core_2.12;0.7.0 in central
found org.antlr#antlr4;4.7 in central
found org.antlr#antlr4-runtime;4.7 in central
found org.antlr#antlr-runtime;3.5.2 in central
found org.antlr#ST4;4.0.8 in central
found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
found org.glassfish#javax.json;1.0.4 in central
found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 1363ms :: artifacts dl 24ms
:: modules in use:
com.ibm.icu#icu4j;58.2 from central in [default]
io.delta#delta-core_2.12;0.7.0 from central in [default]
org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
org.antlr#ST4;4.0.8 from central in [default]
org.antlr#antlr-runtime;3.5.2 from central in [default]
org.antlr#antlr4;4.7 from central in [default]
org.antlr#antlr4-runtime;4.7 from central in [default]
org.glassfish#javax.json;1.0.4 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 8 | 0 | 0 | 0 || 8 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c6184f38-2d95-498c-b711-ead1c4e98cdc
confs: [default]
0 artifacts copied, 8 already retrieved (0kB/16ms)
23/01/28 14:44:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/home/eugene/dev/pyspark/delta.py", line 1, in <module>
from delta.tables import *
File "/home/eugene/dev/pyspark/delta.py", line 1, in <module>
from delta.tables import *
ModuleNotFoundError: No module named 'delta.tables'; 'delta' is not a package
23/01/28 14:44:02 INFO ShutdownHookManager: Shutdown hook called
23/01/28 14:44:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-a8e1196f-83c5-4c81-865e-9522d4d0c056
Is there a solution for this?