1

I'm trying to create a DataFrame from 2 custom sentences, just to test. But from the code I made I'm unable to create it.

spark = SparkSession.builder.appName('first').getOrCreate()
df = spark.createDataFrame(
    [
        (0, "Hi this is a Spark tutorial"),
        (1, "This tutorial is made in Python language")
    ], ['id', 'sentence']
)
df.show()

This gives me this error:

Py4JJavaError: An error occurred while calling o73.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (executor driver): org.apache.spark.SparkException: Python worker failed to connect back.

I tried to create a schema

schema = StructType(
    [StructField("id", IntegerType(), True),
    StructField("sentence", StringType(), True)]
)

and pass it like an argument schema=schema but it is the same roadend.

Omar
  • 1,029
  • 2
  • 13
  • 33
  • Your code runs well, it does not have a problem. But Spark is failing for some reason. What's your setup like? Is it a local installation? – ernest_k Feb 01 '22 at 08:15
  • Does this help? https://stackoverflow.com/questions/53252181/python-worker-failed-to-connect-back – blackbishop Feb 01 '22 at 08:16
  • @ernest_k yes, it's a local script. I'm using VSCode and Pyspark 3.2.1 – Omar Feb 01 '22 at 08:19
  • @blackbishop it seems like installing and importing `finspark` on the script works. Thanks! But showing a DataFrame of 2 rows delays almost 20 seconds. Why is that happening? – Omar Feb 01 '22 at 08:23
  • @blackbishop `findspark` is working on some dataframes, not in all cases. Same error in other creations – Omar Feb 01 '22 at 08:38

0 Answers0