0

I have this large dataset around 6gb and have processed and cleaned the data using PySpark and now want to save it so I can use it elsewhere for machine learning uses

I am trying to find the fastest way of saving the datasets. I followed this link, but its taking so long to save the csv or the parquet. How to export a table dataframe in PySpark to csv?

Please can someone provide some information on how I can do this

user3234242
  • 165
  • 7
  • csv is probably not a good format for such a large dataset, parquet would be much better – Chris Mar 17 '22 at 16:37
  • I tried following the tutorial from the link in my question but still takes so long to follow it. Using this code: `scaled.withColumn("par_id",col('col_1')%50). \ repartition(50, 'col_1').write.format('parquet'). \ save("/saveFolder")` – user3234242 Mar 17 '22 at 16:39

0 Answers0