I parsed 500k tweets as a test using spark NLP. The dataframe looks fine. I converted the arrays to a string. Using
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
result = result.withColumn('token', array_to_string_udf(result["token"])).withColumn('ner', array_to_string_udf(result["ner"])).withColumn('embeddings', array_to_string_udf(result["embeddings"])).withColumn('ner_chunk', array_to_string_udf(result["ner_chunk"])).withColumn('document', array_to_string_udf(result["document"]))
The dataframe looks fine. However, whenever I try to convert it to pandas, export it to a csv I keep getting the following error
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "C:\spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 584, in main
File "C:\spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 562, in read_int
length = stream.read(4)
File "C:\ProgramData\Anaconda3\lib\socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
Which makes me think spark is not talking to python. Does anyone know what the problem might be?