0

I'm trying to get my PySpark to work with Delta table.

I did "pip install delta" as well as "pip install delta-spark"

This is my delta.py script:

from delta.tables import *
from pyspark.sql.functions import *

deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")

# Update every even value by adding 100 to it
deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = { "id": expr("id + 100") })

# Delete every even value
deltaTable.delete(condition = expr("id % 2 == 0"))

# Upsert (merge) new data
newData = spark.range(0, 20)

deltaTable.alias("oldData") \
  .merge(
    newData.alias("newData"),
    "oldData.id = newData.id") \
  .whenMatchedUpdate(set = { "id": col("newData.id") }) \
  .whenNotMatchedInsert(values = { "id": col("newData.id") }) \
  .execute()

deltaTable.toDF().show()

Here is my spark-submit command:

spark-submit --packages io.delta:delta-core_2.12:0.7.0  --master local[*] --executor-memory 2g delta.py

Here is the output containing the error:

:: loading settings :: url = jar:file:/mnt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/eugene/.ivy2/cache
The jars for the packages stored in: /home/eugene/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c6184f38-2d95-498c-b711-ead1c4e98cdc;1.0
    confs: [default]
    found io.delta#delta-core_2.12;0.7.0 in central
    found org.antlr#antlr4;4.7 in central
    found org.antlr#antlr4-runtime;4.7 in central
    found org.antlr#antlr-runtime;3.5.2 in central
    found org.antlr#ST4;4.0.8 in central
    found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
    found org.glassfish#javax.json;1.0.4 in central
    found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 1363ms :: artifacts dl 24ms
    :: modules in use:
    com.ibm.icu#icu4j;58.2 from central in [default]
    io.delta#delta-core_2.12;0.7.0 from central in [default]
    org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
    org.antlr#ST4;4.0.8 from central in [default]
    org.antlr#antlr-runtime;3.5.2 from central in [default]
    org.antlr#antlr4;4.7 from central in [default]
    org.antlr#antlr4-runtime;4.7 from central in [default]
    org.glassfish#javax.json;1.0.4 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   8   |   0   |   0   |   0   ||   8   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c6184f38-2d95-498c-b711-ead1c4e98cdc
    confs: [default]
    0 artifacts copied, 8 already retrieved (0kB/16ms)
23/01/28 14:44:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/home/eugene/dev/pyspark/delta.py", line 1, in <module>
    from delta.tables import *
  File "/home/eugene/dev/pyspark/delta.py", line 1, in <module>
    from delta.tables import *
ModuleNotFoundError: No module named 'delta.tables'; 'delta' is not a package
23/01/28 14:44:02 INFO ShutdownHookManager: Shutdown hook called
23/01/28 14:44:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-a8e1196f-83c5-4c81-865e-9522d4d0c056

Is there a solution for this?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Eugene Goldberg
  • 14,286
  • 20
  • 94
  • 167
  • Should be `pip` or `pip3`? Please make sure the package is installed to the proper Python version (used by Spark), in case if you have multiple versions running on the system. – Vikramsinh Shinde Jan 28 '23 at 15:16
  • I believe I am doing it the proper way. Please take a look: (venv) eugene@Linux-Spark-Kafka:~/dev/pyspark$ pip list Package Version ------------------ -------- delta 0.4.2 delta-spark 2.2.0 importlib-metadata 6.0.0 pip 22.3.1 py4j 0.10.9.5 pyspark 3.3.1 setuptools 58.1.0 zipp 3.12.0 >>> from delta.tables import * from delta.tables import * ModuleNotFoundError: No module named 'delta.tables'; 'delta' is not a package – Eugene Goldberg Jan 28 '23 at 15:39
  • Hope you have not missed this: https://stackoverflow.com/questions/65553722/no-module-named-delta-tables – Vikramsinh Shinde Jan 28 '23 at 15:51
  • If you are using a virtual environment then need to look around it, is it installed in `venv`? – Vikramsinh Shinde Jan 28 '23 at 16:33
  • uninstall `delta`. Also, what is Apache Spark version? Delta 0.7.0 is ancient version, and won't work with the latest Spark versions – Alex Ott Jan 28 '23 at 20:28
  • Apache Spark is at version 3.3.1 – Eugene Goldberg Jan 28 '23 at 21:13

0 Answers0