7

I try to connect to remote spark master from notebook on my local machine.

When I try creating sparkContext

sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077", 
                          appName="jupyter notebook_test"),

I get following exception:

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    134         try:
    135             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 136                           conf, jsc, profiler_cls)
    137         except:
    138             # If an error occurs, clean up in order to allow future SparkContext creation:

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
    196 
    197         # Create the Java SparkContext through Py4J
--> 198         self._jsc = jsc or self._initialize_context(self._conf._jconf)
    199         # Reset the SparkConf to the one actually used by the SparkContext in JVM.
    200         self._conf = SparkConf(_jconf=self._jsc.sc().conf())

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _initialize_context(self, jconf)
    304         Initialize SparkContext in function to allow subclass specific initialization
    305         """
--> 306         return self._jvm.JavaSparkContext(jconf)
    307 
    308     @classmethod

/opt/.venv/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1523         answer = self._gateway_client.send_command(command)
   1524         return_value = get_return_value(
-> 1525             answer, self._gateway_client, None, self._fqn)
   1526 
   1527         for temp_arg in temp_args:

/opt/.venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:516)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)

At the same time, I can create spark context using the same interpreter in interactive mode.

What I should do to connect to remote spark master from my local jupyter notebook?

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • This usually happens if PySpark is not able to communicate with the master. Make sure the hostname is correct and that you have correctly set `SPARK_HOME` and `PYSPARK_PYHON` in the environment. Mismatched local and remote Spark versions can result in that error too. – Hristo Iliev Apr 27 '20 at 12:03
  • I have she same spark on the my workstation and on cluster (2.4.5). I already set PYSPARK_PYTHON and SPARK_HOME. It helps me connect to cluster using python, but I can't do this using notebook @HristoIliev Maybe I should set special setting for jupyter? – Grigory Skvortsov Apr 27 '20 at 12:11
  • 1
    Print the value of `os.environ` in both stand-alone `python` and in your notebook and look for differences. – Hristo Iliev Apr 27 '20 at 12:18
  • Thank you so much! My notebook's environ doen't contains SPARK_HOME. – Grigory Skvortsov Apr 27 '20 at 12:33
  • 1
    You can use [findspark](https://pypi.org/project/findspark/) to simplify the process - it sets all the environment variables for you. – Hristo Iliev Apr 27 '20 at 12:35
  • @HristoIliev and Grigory: Maybe one of you can add this as an answer here. – Shaido Apr 28 '20 at 09:09
  • @HristoIliev set them on the remote machine or local machine? – Snow Nov 09 '20 at 13:40
  • @Snow on the local machine. – Hristo Iliev Nov 09 '20 at 15:49

1 Answers1

4

I solved my problem using @HristoIliev advice. In my case, PYSPARK_PYTHON was not set inside the jupyter environment. Simple solution:

import os
os.environ["PYSPARK_PYTHON"] = '/opt/.venv/bin/python'
os.environ["SPARK_HOME"] = '/opt/spark'

Also you can use findspark for this, but I didn't test it.

Shaido
  • 27,497
  • 23
  • 70
  • 73