6

Hi I'm trying to learn how to use pyspark but when I run this first line :

import pyspark
sc = pyspark.SparkContext('local[*]')

I get this error :

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x724b93a8) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x724b93a8

I can't seem to find what's causing it :/

3 Answers3

7

Spark runs on Java 8/11, Scala 2.12, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0.

Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0

For the Scala API, Spark 3.2.0 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).

For Python 3.9, Arrow optimization and pandas UDFs might not work due to the supported Python versions in Apache Arrow. Please refer to the latest Python Compatibility page.

For Java 11, -Dio.netty.tryReflectionSetAccessible=true is required additionally for Apache Arrow library. This prevents java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available when Apache Arrow uses Netty internally.

JCompetence
  • 6,997
  • 3
  • 19
  • 26
  • Hi, I am getting the same error, I have Scala 2.13.8, java 17.0.2 and python 3.9. I am on Mac, installing apache-spark for the first time. Any idea what am I doing wrong? – Yashashvi Mar 24 '22 at 23:44
2

I’ve provided a spark installation link How to Install and Run PySpark in Jupyter Notebook on Windows

I’ve provided a spark installation video link youtube Video how to Run PySpark in Jupyter Notebook on Windows

This works me

Source: Eden Canlilar

How to Install and Run PySpark in Jupyter Notebook on Windows

When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.

A. Items needed

  1. Spark distribution from spark.apache.orgspark.apache.org Download Apache Spark
  2. Python and Jupyter Notebook. You can get both by installing the Python 3.x version of [Anaconda distribution.]
  3. winutils.exe — a Hadoop binary for Windows — from Steve Loughran’s GitHub repo. Go to the corresponding Hadoop version in the Spark distribution and find winutils.exe under /bin. For example, https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe enter image description here
  4. The findspark Python module, which can be installed by running python -m pip install findspark either in Windows command prompt or Git bash if Python is installed in item 2. You can find command prompt by searching cmd in the search box. enter image description here
  5. If you don’t have Java or your Java version is 7.x or less, download and install Java from Oracle. I recommend getting the latest JDK (current version 9.0.1). enter image description here
  6. If you don’t know how to unpack a .tgz file on Windows, you can download and install 7-zip on Windows to unpack the .tgz file from Spark distribution in item 1 by right-clicking on the file icon and select 7-zip > Extract Here. enter image description here

B. Installing PySpark

After getting all the items in section A, let’s set up PySpark.

  1. Unpack the .tgz file. For example, I unpacked with 7zip from step A6 and put mine under D:\spark\spark-2.2.1-bin-hadoop2.7 enter image description here

  2. Move the winutils.exe downloaded from step A3 to the \bin folder of Spark distribution. For example, D:\spark\spark-2.2.1-bin-hadoop2.7\bin\winutils.exe

  3. Add environment variables: the environment variables let Windows find where the files are when we start the PySpark kernel. You can find the environment variable settings by putting “environ…” in the search box.

    The variables to add are, in my example,

    Name Value
    SPARK_HOME D:\spark\spark-2.2.1-bin-hadoop2.7
    HADOOP_HOME D:\spark\spark-2.2.1-bin-hadoop2.7
    PYSPARK_DRIVER_PYTHON jupyter
    PYSPARK_DRIVER_PYTHON_OPTS notebook

    enter code here

  4. In the same environment variable settings window, look for the Path or PATH variable, click edit and add D:\spark\spark-2.2.1-bin-hadoop2.7\bin to it. In Windows 7 you need to separate the values in Path with a semicolon ; between the values.

  5. (Optional, if see Java related error in step C) Find the installed Java JDK folder from step A5, for example, D:\Program Files\Java\jdk1.8.0_121, and add the following environment variable

    Name Value
    JAVA_HOME D:\Progra~1\Java\jdk1.8.0_121

If JDK is installed under \Program Files (x86), then replace the Progra~1 part by Progra~2 instead. In my experience, this error only occurs in Windows 7, and I think it’s because Spark couldn’t parse the space in the folder name. Edit (1/23/19): You might also find Gerard’s comment helpful: How to Install and Run PySpark in Jupyter Notebook on Windows

C. Running PySpark in Jupyter Notebook

To run Jupyter notebook, open Windows command prompt or Git Bash and run jupyter notebook. If you use Anaconda Navigator to open Jupyter Notebook instead, you might see a Java gateway process exited before sending the driver its port number error from PySpark in step C. Fall back to Windows cmd if it happens.

Once inside Jupyter notebook, open a Python 3 notebook Create a new notebook In the notebook, run the following code

import findspark
findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.sql('''select 'spark' as hello ''')
df.show()    


When you press run, it might trigger a Windows firewall pop-up. I pressed cancel on the pop-up as blocking the connection doesn’t affect PySpark.

If you see the following output, then you have installed PySpark on your Windows system! following output

thrinadhn
  • 1,673
  • 22
  • 32
2

What worked for me:

brew install openjdk@8

sudo ln -sfn /usr/local/opt/openjdk@8/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-8.jdk

If you need to have openjdk@8 first in your PATH, run: echo 'export PATH="/usr/local/opt/openjdk@8/bin:$PATH"' >> ~/.zshrc

source ~/.zshrc

Green Lion
  • 21
  • 1