-1

I'm working on a project using spark-nlp and I believe I've somehow set up something incorrectly as I can only get it to successfully run my python file when I use sudo python main.py, this is also preventing me from using my IDE for jupyter notebooks. I would like help fixing my system so I can just run python main.py.

If I don't use sudo, I get the following error:

Exception: Java gateway process exited before sending the driver its port number

I followed many of the solutions listed on the biggest stackoverflow question on this topic including:

  • Installing apache-spark with homebrew and without

  • Using pyspark from pip instead

  • Setting JAVA_HOME to jdk8 and jdk11

  • Setting other variables such as SPARK_HOME, PYTHONPATH, PYSPARK_DRIVER etc.

Finally, I saw a comment stating that the IDE didn't have the correct permissions and so they were launching their IDE with sudo. That gave me the idea to try running it with sudo python and suddenly I no longer got the error. It appears that pyspark needs to create a temporary folder and it doesn't have the permission to do so without sudo?

Currently I have JDK11 set as JAVA_HOME and no other env variables set. I am using a conda environment with python 3.11, spark-nlp 5.0.2 and pyspark 3.1.1 on a M1 Pro Mac, MacOS13.4

Edit: As requested I've included the full stacktrace when I run python without sudo permissions.

Exception in thread "main" java.lang.RuntimeException: [download failed: javax.annotation#javax.annotation-api;1.3.2!javax.annotation-api.jar]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1456)
        at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/Users/username/code/nlp-python/example/main.py", line 3, in <module>
    spark = sparknlp.start(apple_silicon=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/sparknlp/__init__.py", line 300, in start
    spark_session = start_without_realtime_output()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/sparknlp/__init__.py", line 198, in start_without_realtime_output
    return builder.getOrCreate()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/sql/session.py", line 269, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/context.py", line 483, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/context.py", line 195, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/context.py", line 417, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
                                       ^^^^^^^^^^^^^^^^^^^^
  File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway
    raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
Gekko
  • 1
  • 1
  • 1
  • There must be more errors that indicate why the gateway process exited – Bernhard Stadler Aug 12 '23 at 10:13
  • I updated my post to include the full stacktrace, what I gathered from the other post is that to launch the gateway, pyspark needs to create a temporary directory and without the sudo permission it fails to do so which causes the exception. However I don't have a great understanding of permissions or pyspark so I'm not really sure what to do with this information. – Gekko Aug 12 '23 at 14:45
  • And there are no error messages before this one? – Bernhard Stadler Aug 14 '23 at 10:20
  • No other error messages before this one – Gekko Aug 15 '23 at 01:47
  • Then it's probably a permissions problem. Just try changing the file owner of all files and directory in your conda environment and your project directory to yourself. I'm not familiar with MacOS, but for Linux it would be `sudo chown -R your_user_name /Users/username/anaconda3/envs/sparknlp /your/project/directory`. The execution may take a while because conda environments tend to contain many files. – Bernhard Stadler Aug 15 '23 at 08:45

0 Answers0