0

When using com.crealytics:spark-excel_2.12:0.14.0 without delta:

spark = SparkSession.builder.appName("Word Count")
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0")
.getOrCreate()

df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.load(path2)

It works, and I can read excel files fine. But creating a session with configure_spark_with_delta_pip:

builder = SparkSession.builder.appName("transaction")
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

Gives me the following error:

Py4JJavaError: An error occurred while calling o139.load. : java.lang.ClassNotFoundException: Failed to find data source: com.crealytics.spark.excel. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.ClassNotFoundException: com.crealytics.spark.excel.DefaultSource at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666) at scala.util.Failure.orElse(Try.scala:224) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666) ... 14 more

Why? And how can I avoid this?

OMA
  • 543
  • 2
  • 6
  • 18

1 Answers1

0

You are getting this error because configure_spark_with_delta_pip overwrites/replaces your config property spark.jars.packages whith the appropriate delta lake package to be imported. As such your package com.crealytics:spark-excel_2.12:0.14.0 may not be available/imported. See a snippet from the source code here

    scala_version = "2.12"
    maven_artifact = f"io.delta:delta-core_{scala_version}:{delta_version}"

    return spark_session_builder.config("spark.jars.packages", maven_artifact) 

Unfortunately at this time, the Builder does not allow us to retrieve existing config properties or the SparkConf object to adjust these properties dynamically before calling getOrCreate to create or spark session.

Approach 1

To resolve this you may retrieve the appropriate delta package yourself similar to how configure_spark_with_delta_pip does this eg.


import importlib_metadata
delta_version = importlib_metadata.version("delta_spark")
scala_version = "2.12"
delta_package = f"io.delta:delta-core_{scala_version}:{delta_version}"

builder = SparkSession.builder.appName("transaction")
.config("spark.jars.packages", f"com.crealytics:spark-excel_2.12:0.14.0,{delta_package}")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()


Approach 2

To resolve this you may create the spark session after applying the delta package with configure_spark_with_delta_pip. After this, you may trigger the re-initiation of the spark session with the updated config properties.

eg.

builder = SparkSession.builder.appName("transaction")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

builder = SparkSession.builder.appName("transaction")
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0")

spark = builder.getOrCreate()

Since both spark sessions have the same appName the getOrCreate will retrieve the existing spark session but will also apply the new configuration. This behaviour is documented here as

In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.

>>> s2 = SparkSession.builder.config("k2", "v2").getOrCreate()
>>> s1.conf.get("k1") == s2.conf.get("k1") 
True
>>> s1.conf.get("k2") == s2.conf.get("k2") 
True

Let me know if this works for you.

ggordon
  • 9,790
  • 2
  • 14
  • 27