Export Multiple GB PySpark dataframe to CSV, np or pickle

Asked Mar 17 '22 at 16:35

Active Mar 18 '22 at 10:34

Viewed 96 times

I have this large dataset around 6gb and have processed and cleaned the data using PySpark and now want to save it so I can use it elsewhere for machine learning uses

I am trying to find the fastest way of saving the datasets. I followed this link, but its taking so long to save the csv or the parquet. How to export a table dataframe in PySpark to csv?

Please can someone provide some information on how I can do this

edited Mar 18 '22 at 10:34

asked Mar 17 '22 at 16:35

user3234242

csv is probably not a good format for such a large dataset, parquet would be much better – Chris Mar 17 '22 at 16:37
I tried following the tutorial from the link in my question but still takes so long to follow it. Using this code: `scaled.withColumn("par_id",col('col_1')%50). \ repartition(50, 'col_1').write.format('parquet'). \ save("/saveFolder")` – user3234242 Mar 17 '22 at 16:39

Export Multiple GB PySpark dataframe to CSV, np or pickle

0 Answers0