0

I'm trying to run simple code on Dataproc Jupyter notebook to write data to delta table. Code is working fine in Python notebook, but when I run the same code on Jupyter notebook running into issues while writing to delta.

Upon further debugging, I'm guessing issues could be due to delta jars not loaded while spark session creation (see message below).

Question:

  • How can we verify if jars are loaded in the session?
  • How can we add jars if they are not being added in below pyspark approach?

On Python Notebook:

from pyspark.sql import SparkSession

# spark.stop

spark = SparkSession.builder \
    .appName('test_session_3') \
    .config("spark.jars.packages", "io.delta:delta-storage:2.3.0") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Message: has a mention of jars added to cache

23/08/18 11:06:00 INFO SparkEnv: Registering MapOutputTracker
23/08/18 11:06:00 INFO SparkEnv: Registering BlockManagerMaster
23/08/18 11:06:00 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/08/18 11:06:00 INFO SparkEnv: Registering OutputCommitCoordinator
23/08/18 11:06:00 WARN Client: Same path resource file:///root/.ivy2/jars/io.delta_delta-core_2.12-2.3.0.jar added multiple times to distributed cache.
23/08/18 11:06:00 WARN Client: Same path resource file:///root/.ivy2/jars/io.delta_delta-storage-2.3.0.jar added multiple times to distributed cache.
23/08/18 11:06:00 WARN Client: Same path resource file:///root/.ivy2/jars/org.antlr_antlr4-runtime-4.8.jar added multiple times to distributed cache.

On PySpark Notebook:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName('test_session_21') \
    .config("spark.jars.packages", "io.delta:delta-storage:2.3.0") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Message : No mention of jars added to cache

23/08/18 11:04:43 INFO SparkEnv: Registering MapOutputTracker
23/08/18 11:04:43 INFO SparkEnv: Registering BlockManagerMaster
23/08/18 11:04:43 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/08/18 11:04:43 INFO SparkEnv: Registering OutputCommitCoordinator

0 Answers0