How do I establish a connection to the PySpark Interpreter using Windows Command Line, Powershell, or Jupyter Notebook?

Question

I am using Windows 11 pro on a 64 bit PC. I have followed instructions to download and set up a Hadoop environment (version 3.3.1) and stored the Winutils.exe file (hadoop-3.0.0 version) in the 'bin' folder, downloaded an equivalent version of Spark (version 3.3.1) and created the new environment variables and set the correct path to each in my C: drive. Also my JAVA_HOME path is set to C:\Program Files\Java\jdk-18\bin\java.exe.

I am able to access the Scala interpreter using the 'spark-shell' command but I'm unable to access the PySpark interpreter using the 'pyspark' or '.\pyspark' commands. I have also installed the pyspark (3.2.1) and py4j (0.10.9.3) libraries in Anaconda but get the following error when I run the 'pyspark' command in Powershell or Jupyter Notebook.

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x1f97cf0d) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x1f97cf0d
    at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
    at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
    at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
    at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
    at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:67)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:483)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)

Try to setup docker, and use docker images to host jupyter labs, you can set it up on Windows Please refer to this AWS document (Jupyter Lab section) for: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html — Yuva, Jan 09 '23 at 02:53
https://stackoverflow.com/questions/74663072/glue-jupyter-notebook-locally-instead-of-labs/74673263?noredirect=1#comment131800779_74673263 — Yuva, Jan 09 '23 at 02:54

Lyn Stanford · Answer 1 · 2023-01-14T20:13:47.170

Thank you for the alternative solution. After some research I was able to establish a working version of the PySpark Interpreter using the following method:

Downloading the latest version of Java SDK and setting the %JAVA_HOME% variable to the correct path in 'Environment Variables'.
Downloading hadoop-3.3.1-src.tar.gz and un-zipping, before setting the %HADOOP_HOME% variable to the correct path in 'Environment Variables'.
Downloading spark-3.3.1-bin-hadoop3.tgz and un-zipping before setting the %SPARK_HOME% variable to the correct path in 'Environment Variables'.
Downloading Winutils.exe from 'https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin' and placing the executable file in the '\spark\bin' directory.
Checking that Windows Subsystem for Linux has equivalency across Docker Desktop and different Linux Distro's using the following command: > wsl -l -v. If not, then updating WSL to version 2 in the command prompt or Powershell using: > wsl --set-version (distro name) 2.
Setting up the local metastore using this command: > mkdir C:\tmp\hive.
Then: > C:\hadoop\bin\winutils.exe chmod 777 /tmp/hive.
Navigate to the correct file path where your Spark executable installation is located: > cd C:\spark\bin (for example).
C:\spark\bin> pyspark --master-local.
At the C:\spark\bin> prompt, type the 'pyspark' command: C:\spark\bin> pyspark.

This seems to have worked!

How do I establish a connection to the PySpark Interpreter using Windows Command Line, Powershell, or Jupyter Notebook?

1 Answers1