I'm working on a project using spark-nlp and I believe I've somehow set up something incorrectly as I can only get it to successfully run my python file when I use sudo python main.py
, this is also preventing me from using my IDE for jupyter notebooks. I would like help fixing my system so I can just run python main.py
.
If I don't use sudo
, I get the following error:
Exception: Java gateway process exited before sending the driver its port number
I followed many of the solutions listed on the biggest stackoverflow question on this topic including:
Installing apache-spark with homebrew and without
Using pyspark from pip instead
Setting JAVA_HOME to jdk8 and jdk11
Setting other variables such as SPARK_HOME, PYTHONPATH, PYSPARK_DRIVER etc.
Finally, I saw a comment stating that the IDE didn't have the correct permissions and so they were launching their IDE with sudo. That gave me the idea to try running it with sudo python and suddenly I no longer got the error. It appears that pyspark needs to create a temporary folder and it doesn't have the permission to do so without sudo?
Currently I have JDK11 set as JAVA_HOME and no other env variables set. I am using a conda environment with python 3.11, spark-nlp 5.0.2 and pyspark 3.1.1 on a M1 Pro Mac, MacOS13.4
Edit: As requested I've included the full stacktrace when I run python without sudo permissions.
Exception in thread "main" java.lang.RuntimeException: [download failed: javax.annotation#javax.annotation-api;1.3.2!javax.annotation-api.jar]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1456)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/Users/username/code/nlp-python/example/main.py", line 3, in <module>
spark = sparknlp.start(apple_silicon=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/sparknlp/__init__.py", line 300, in start
spark_session = start_without_realtime_output()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/sparknlp/__init__.py", line 198, in start_without_realtime_output
return builder.getOrCreate()
^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/sql/session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/context.py", line 483, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/context.py", line 195, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/context.py", line 417, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
^^^^^^^^^^^^^^^^^^^^
File "/Users/username/anaconda3/envs/sparknlp/lib/python3.11/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number