How to split the input data into several files based of date field in pyspark?

Question

I have a hive table with a date field in it.

+----------+------+-----+ 
|data_field|  col1| col2| 
+----------+------+-----+ 
|10/01/2018|   125|  abc| 
|10/02/2018|   124|  def| 
|10/03/2018|   127|  ghi| 
|10/04/2018|   127|  klm| 
|10/05/2018|   129|  nop| 
+----------+------+-----+

I am reading the table as below.

hive_context = HiveContext(sc)
df = hive_context.sql("select data_field, col1 , col2 from table")

I would like to split the input data into several files based on the date_field column and drop it in the date_field folder. The output should look something like below.

/data/2018-10-01/datafile.csv 
/data/2018-10-02/datafile.csv 
/data/2018-10-03/datafile.csv 
/data/2018-10-04/datafile.csv 
/data/2018-10-05/datafile.csv

for example: The file (/data/2018-10-01/datafile.csv ) should have the below data.

data_field,col1,col2
10/01/2018,125,abc

What approach should I take to achieve this?

score 0 · Answer 1 · answered Oct 12 '18 at 06:02

0

Look at partitionBy() in DataFrameWriter class. Example usage would be df.write.partitionBy(date)...

answered Oct 12 '18 at 06:02

ShirishT

232
1
4

Would this work on spark 1.6 without using databricks libraries for csv files ? – Bob Oct 14 '18 at 04:20

How to split the input data into several files based of date field in pyspark?

1 Answers1