3

Brief

What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

Some Information

  • Java: openjdk 11
  • Python: 3.8.5
  • Mode: local mode
  • Operating System: Ubuntu 16.04.6 LTS
  • Notes:
    1. I executed python3.8 -m pip install pyspark to install Spark.
    2. When I looked up the jar of spark-sql_2.12-3.0.0.jar (which is under the Python site-package path, i.e., ~/.local/lib/python3.8/site-packages/pyspark/jars in my case), there is no v2 under spark.sql.sources, the most similar one I found is an interface called DatSourceRegister under the same package.
    3. The most similar question I found on Stackoverflow is PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport ) where downgrading the Spark version is recommended throughout the information on that page.

Exception Message

Py4JJavaError: An error occurred while calling o94.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
    at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
    at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800)
    at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:398)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1209)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1220)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1264)
    at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1299)
    at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1384)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
    at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
    at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
    at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
    at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
    at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:644)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
    ... 45 more
Scott Hsieh
  • 1,339
  • 1
  • 11
  • 25
  • pyspark is bundled with Spark itself, you may not need to install it separately. Also, check with `pip list` command that you dont have another version – Alex Ott Jul 29 '20 at 10:13
  • @AlexOtt , thanks for reminding. I know Python is just a wrapper since the core of Spark is written by Scala and pyspark actually works by manipulating Java objects via py4j. A point I mentioned in the post for looking up the jar is by checking it under `${SPARK_HOME}/jars` and ${SPARK_HOME} is under the Python site-package path, which is `~/.local/lib/python3.8/site-packages/pyspark` in my case. – Scott Hsieh Jul 30 '20 at 02:44
  • Did you use the spark 3.0.0 binary without hadoop? I was able to brew install a local spark 3.0.0 on a macbook that is able to read files. If I call `spark_session.read.parquet()` using the spark binary without hadoop on an Ubuntu cluster I ran into the same issue. Although the java class that's not found for me is `java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport`. So I'm wondering if this is a reproducible issue for the `spark-3.0.0-bin-without-hadoop.tgz` binary. – bricard Jul 30 '20 at 17:52
  • @bricard only have tried the installation by pip and building from sources. Maybe I could try downloading the binary that you mentioned, checking whether I'll encounter the same or similar issue by running the standalone mode – Scott Hsieh Jul 31 '20 at 05:14

3 Answers3

1

I had this same problem with spark 3 and finally figured out the cause. I was including a custom jar that relied on the old datasource v2 api.

The solution was to remove the custom jar then spark began working properly.

bricard
  • 26
  • 2
  • I built up a Spark cluster few days ago, with standalone mode which is composed of one master and four workers. I downloaded `spark-3.0.0-bin-hadoop3.2.tgz` to build up. And I can retrieve data with formats that Spark supports from AWS S3 without the exception message or similar ones. – Scott Hsieh Sep 02 '20 at 06:31
0

currently, I have got a way out for manipulating data via Python function APIs for Spark.

workaround

1

# clone a specific branch 
git clone -b branch-3.0 --single-branch https://github.com/apache/spark.git
## could try the follwoing command
## git clone --branch v3.0.0 https://github.com/apache/spark.git

# build a Spark distribution
cd spark
./dev/make-distribution.sh --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## after changing the value of SPARK_HOME in `.bashrc_profile`
source ~/.bashrc_profile

# downlaod needed additional jars into the directory
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.828/aws-java-sdk-bundle-1.11.828.jar
cd ${SPARK_HOME}

# add related configuraionts for Spark
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## add required or desired parameters into the `spark-defaults.conf`
## as of me, I edited the configuraion file by `vi`

# launch an interactive shell
pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 24 2020 05:43:01)
SparkSession available as 'spark'.
## after launching, I can read parquet and csv files without the exception

2
after setting up all the stuff mentioned above, add ${SPARK_HOME}/python to the environment variable PYTHONPATH, then remember to source the related file (I added it into .bashrc_profile).

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set("spark.driver.memory", "10g")
sc.set('spark.hadoop.fs.s3a.threads.max', threads_max)
sc.set('spark.hadoop.fs.s3a.connection.maximum', connection_max)
sc.set('spark.hadoop.fs.s3a.aws.credentials.provider',
           'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
sc.set('spark.driver.maxResultSize', 0)
spark = SparkSession.builder.appName("cest-la-vie")\
    .master("local[*]").config(conf=sc).getOrCreate()
## after launching, I can read parquet and csv files without the exception

notes

I've also attempted to make PySpark pip installable from the sources' building, but I was stuck on the uploading file size to testpypi. This trying is that I want the pyspark package to be present under the site package directory. The following is my attempting steps:

cd ${SPARK_HOME}/python
# Step 1
python3.8 -m pip install --user --upgrade setuptools wheel
# Step 2
python3.8 setup.py sdist bdist_wheel ## /opt/spark/python
# Step 3
python3.8 -m pip install --user --upgrade twine
# Step 4
python3.8 -m twine upload --repository testpypi dist/*
## have registered an account for testpypi and got a token
Uploading pyspark-3.0.1.dev0-py2.py3-none-any.whl

## stuck here
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49<00:00, 7.33MB/s]
Received "503: first byte timeout" Package upload appears to have failed.  Retry 1 of 5
Scott Hsieh
  • 1,339
  • 1
  • 11
  • 25
0

I was using a standalone installation of Spark 3.1.1.

I have tried a lot of things.

I have excluded a lot of jar files.

After a lot of suffering, I decided to delete my Spark installation and install(unpack) a new one.

I don't know why... but it's working.

Pastor
  • 318
  • 1
  • 11