Pyspark show() or collect() wont run in Pycharm

Question

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date,month

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Smith","USA","2019-06-24 12:01:19.000"),
    ("Michael","Rose","USA","2019-06-24 12:01:19.000"),
    ("Robert","Williams","USA","2019-06-24 12:01:19.000"),
    ("Maria","Jones","USA","2019-06-24 12:01:19.000")
  ]
columns = ["firstname","lastname","country","datetime_column"]
df = spark.createDataFrame(data = data, schema = columns)

df = df.withColumn('month', month(to_date(df['datetime_column'])).cast('int'))
df.show()

I have this code on pyspark, but when I try to run it will give me an error like this:

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Python bulunamad22/12/31 15:22:05 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
22/12/31 15:22:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.

ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "C:/Users/User/PycharmProjects/tasks/main.py", line 15, in <module>
    df.show()
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\pyspark\sql\dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\py4j\java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\pyspark\sql\utils.py", line 190, in deco
    return f(*a, **kw)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o45.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage

My java version is 11 and pyspark is 3.3.1 Cant think of anything and stuck here, any help would be appreciated.

did you install java and add it your environment variables ? — Amir Hossein Shahdaei, Dec 31 '22 at 13:40
yes I have, I have no problems running java but it is version 11 if thats a problem. Also my all environment variables are set, hadoop winutils included — Karavana, Dec 31 '22 at 13:43
Java version 11 is a problem as it is not compatible with spark . you should change your java installation to java 8 — Nikunj Kakadiya, Jan 01 '23 at 14:10
That didnot solve the problem. I have installed java8 , forced pycharm to use it but no improvement, same error. — Karavana, Jan 15 '23 at 11:44
have you tried [this](https://stackoverflow.com/questions/53252181/python-worker-failed-to-connect-back)? — itscarlayall, Jan 31 '23 at 14:07

Pyspark show() or collect() wont run in Pycharm

0 Answers0