0

i'm trying to run a pyspark shell but when doing:

(test3.8python) [test@JupyterHub ~]$ python3 /home/test/spark3.1.1/bin/pyspark

i get the following error:

File "/home/test/spark3.1.1/bin/pyspark", line 20  
if [ -z "${SPARK_HOME}" ]; then
        ^
SyntaxError: invalid syntax

i've set in ~/.bashrc the following:

export SPARK_HOME=/home/test/spark3.1.1
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

if i try to run it from a jupyter notebook as follows:

import pyspark
from pyspark.sql import SparkSession

#starting deamons for standalone
!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077
        
#spark standalone
spark = SparkSession.builder \
        .appName("test") \
        .master("spark://JupyterHub:7077")\
        .config("spark.cores.max","5")\
        .config("spark.executor.memory","2g")\
        .config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12-SNAPSHOT')\
        .config("spark.executor.cores","5")\
        .enableHiveSupport() \
        .getOrCreate()

i get the following error:

ModuleNotFoundError: No module named 'pyspark'

but i don't understand why since i've specified in the bash the path to the python files in my spark folder, and made sure that the changes took place. Furthermore i was fiddling around and tried to user the library findspark, now if i try to run this code with the added import:

import findspark
spark_location='/home/test/spark3.1.1/' 
findspark.init(spark_home=spark_location)
import pyspark
from pyspark.sql import SparkSession

#starting deamons for standalone
!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077
        
#spark standalone
spark = SparkSession.builder \
        .appName("test") \
        .master("spark://JupyterHub:7077")\
        .config("spark.cores.max","5")\
        .config("spark.executor.memory","2g")\
        .config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12.0-SNAPSHOT')\
        .config("spark.executor.cores","5")\
        .enableHiveSupport() \
        .getOrCreate()

it looks like it's able to find pyspark, but this makes 0 sense since i've specified everything already in bash file and already set SPARK_HOME, however i get another error:

starting org.apache.spark.deploy.master.Master, logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.master.Master-1-JupyterHub.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.worker.Worker-1-JupyterHub.out

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-7-7d402e7d71bf> in <module>
     10 
     11 #spark standalone
---> 12 spark = SparkSession.builder \
     13         .appName("test") \
     14         .master("spark://JupyterHub:7077")\

~/spark3.1.1/python/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

~/spark3.1.1/python/pyspark/context.py in getOrCreate(cls, conf)
    382         with SparkContext._lock:
    383             if SparkContext._active_spark_context is None:
--> 384                 SparkContext(conf=conf or SparkConf())
    385             return SparkContext._active_spark_context
    386 

~/spark3.1.1/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

~/spark3.1.1/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    329         with SparkContext._lock:
    330             if not SparkContext._gateway:
--> 331                 SparkContext._gateway = gateway or launch_gateway(conf)
    332                 SparkContext._jvm = SparkContext._gateway.jvm
    333 

~/spark3.1.1/python/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

I've already check and JupyterHub:7077 on the default 8080 port and everything is alive and well so i succesfully started both the master and worker.

and even by running spark in local mode with master("local[*]") i get the same exact error above

I'm completely lost, any idea on why i can't run pyspark both on shell and on jupyter notebook?

thanks

nonoDa
  • 413
  • 2
  • 16
  • 1
    You don't have to run `python /home/test/spark3.1.1/bin/pyspark`, instead run just `/home/test/spark3.1.1/bin/pyspark`. The `pyspark` file is a bash script that sets some and then executes spark-submit – danielsepulvedab Mar 10 '21 at 21:35
  • Yes that was it, i did not notice it wasn't a py file and i had to do ./pyspark and it ran succesfully, however i still have problems running spark from notebook, any ideas on that? – nonoDa Mar 11 '21 at 08:06

0 Answers0