Python worker failed to connect back in Pyspark or spark Version 2.3.1

Question

After installing anaconda3 and installing spark(2.3.2) I'm trying to run the sample pyspark code.

This is just a sample program I'm running through Jupyter, im getting an error like

Python worker failed to connect back.

As per below question in stack overflow:

i can see a solution like this I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the latest version of pyspark.

But I'm using spark version 2.3.1 and python version is 3.7

still, I'm facing that issue. Please help me to solve this error

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mySparkApp").getOrCreate()
testData=spark.sparkContext.parallelize([3,8,2,5])
testData.count()

The traceback is:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 6, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

I executed the code you provided on my Windows 10 machine and I am running Apache Spark 2.3.0. I am unable to replicate this error. — Yayati Sule, May 21 '19 at 06:24

score 12 · Answer 1 · answered Oct 20 '21 at 06:47

12

just add environment variable PYSPARK_PYTHON as python. It solves the issue. No need to upgrade or Downgrade Spark version. It worked for me.

answered Oct 20 '21 at 06:47

Ramineni Ravi Teja

3,568
26
37

Thanks, it worked for me !. I think it will be different when we use virtual env. – Rakka Alhazimi Jun 06 '22 at 20:29

score 5 · Answer 2 · answered Mar 30 '20 at 12:32

5

Set your environment variables as follows:

PYSPARK_DRIVER_PYTHON=jupyter
PYSPARK_DRIVER_PYTHON_OPTS=notebook
PYSPARK_PYTHON=python

The heart of the problem is the connection between pyspark and python, solved by changing them.

answered Mar 30 '20 at 12:32

Henrique Branco

1,778
1
13
40

1

This worked for me as well. However, I did not have any issues when reading a parquet file and only while creating a dataframe manually. – targetXING Sep 03 '21 at 09:56

score 0 · Answer 3 · answered Oct 20 '22 at 08:20

0

Please make sure you have properly set the environment variable set,

answered Oct 20 '22 at 08:20

Asif Raza

971
6
14

Python worker failed to connect back in Pyspark or spark Version 2.3.1

3 Answers3

Linked

Related