1

I parsed 500k tweets as a test using spark NLP. The dataframe looks fine. I converted the arrays to a string. Using

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

result = result.withColumn('token', array_to_string_udf(result["token"])).withColumn('ner', array_to_string_udf(result["ner"])).withColumn('embeddings', array_to_string_udf(result["embeddings"])).withColumn('ner_chunk', array_to_string_udf(result["ner_chunk"])).withColumn('document', array_to_string_udf(result["document"]))

The dataframe looks fine. However, whenever I try to convert it to pandas, export it to a csv I keep getting the following error

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "C:\spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 584, in main
  File "C:\spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 562, in read_int
    length = stream.read(4)
  File "C:\ProgramData\Anaconda3\lib\socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

Which makes me think spark is not talking to python. Does anyone know what the problem might be?

Viktor Avdulov
  • 127
  • 2
  • 14

1 Answers1

0

When calling toPandas

all the data is loaded into the driver’s memory

The effect of calling toPandas is basically the same as when calling collect.

A better approach to write the contents of a dataframe into a csv would be to use PySpark's DataFrameWriter directly:

result.write.csv('my_result.csv')

Edit: probably not directly related to the question, but it would be possible to replace the udf with native Spark functions (lit, concat and concat_ws):

from pyspark.sql import functions as F

result.withColumn("token", F.concat(F.lit("["), F.concat_ws(",", "token"),F.lit("]")))....

Replacing the udf with Spark functions will increase the performance.

werner
  • 13,518
  • 6
  • 30
  • 45
  • Thank you for your answer but the cited error comes up regardless of wether I try to export to a csv or call the toPandas function – Viktor Avdulov Aug 02 '21 at 03:24
  • @ViktorAvdulov could you please try to limit the amount of data using [limit](http://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.limit.html)? Does `result.limit(100).write.csv('my_result.csv')` return the same error? – werner Aug 02 '21 at 09:08