1

I have a pyspark.sql.dataframe.DataFrame with 1300 rows and 5 columns. I use the following to export the dataframe to C:/temp:

c5.toPandas().to_csv("C:/temp/colspark.csv")

But I get the following error:

<ipython-input-4-2c57938dba1e> in <module>
----> 1 c5.toPandas().to_csv("C:/temp/colspark.csv")

S:\tdv\ab\ecp\Spark\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\dataframe.py in toPandas(self)
   2141 
   2142         # Below is toPandas without Arrow optimization.

(...)

Py4JJavaError: An error occurred while calling o689.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 50.0 failed 1 times, most recent failure: Lost task 0.0 in stage 50.0 (TID 2190, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last)

What I have tried so far:

``spark.conf.set("spark.sql.execution.arrow.enabled", "true")``

But:

``Py4JJavaError                             Traceback (most recent call last)
<ipython-input-5-92bc22b46531> in <module>
      1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 c5.toPandas().to_csv("C:/temp/colspark.csv")

S:\tdv\ab\ecp\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\dataframe.py in toPandas(self)
   2120                         _check_dataframe_localize_timestamps
   2121                     import pyarrow
-> 2122                     batches = self._collectAsArrow()
   2123                     if len(batches) > 0:
   2124                         table = pyarrow.Table.from_batches(batches)

S:\tdv\ab\ecp\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\dataframe.py in _collectAsArrow(self)
   2182                 return list(_load_from_socket((port, auth_secret), ArrowStreamSerializer()))
   2183             finally:
-> 2184                 jsocket_auth_server.getResult()  # Join serving thread and raise any exceptions````


I even followed some solutions from
https://stackoverflow.com/questions/31937958/how-to-export-data-from-spark-sql-to-csv
But I cannot figure out how to proceed anymore. Is there any way to avoid arrow optimisation? Or I have to use another method to save the CSV file?
ecp
  • 319
  • 1
  • 6
  • 18

1 Answers1

1

I understand that you are trying to save spark dataframe to csv file in your local directory. IF so write as below:

dfname.write.csv("c:\\temp\\csvfoldername")
Karthik
  • 1,143
  • 7
  • 12
  • I think the main problem is: ``df2=df1.toPandas()`` gives error "# Below is toPandas without Arrow optimization." – ecp Oct 04 '19 at 07:09
  • when I type c5.columns I have no index and dtype, could be the problem? Columns: ['a', 'b', 'c', 'd']. Should be Index(['a', 'b', 'c', 'd'], dtype='object')? How can I change it? – ecp Oct 04 '19 at 07:21
  • It is definitely another type of problem, your answer is ok. Thanks. – ecp Oct 04 '19 at 07:38