0

I have a dataset with a milliseconds timestamp field, like create_time of bigint type. And I want to write them into different hdfs directories with a formatted partition suffix like xxx/pt=20220101, xxx/pt=20220102 ... It looks like that partitionBy method of DataFrameWriter works for this. But it cannot get a udf(format the bigint milliseconds into a YYYYMMDD date) as its input which is not presented in the schema of the child plan. For example,

spark.sql("select df1.*,df2.* from df1 join df2 on df1.key1 = df2.key2")
    .write
    .option("header", value = true)
    .mode(SaveMode.Overwrite)
    .partitionBy("my_udf(df2.key2)")
    .csv("./test/data")

It failed to run because spark cannot get that expression's type from df1.* and df2.*. Is there any way to solve this problem?

Oli
  • 9,766
  • 5
  • 25
  • 46
doki
  • 85
  • 5
  • df.withColumn("name col to partition", your_udf func) and then use partitionBy("name col to partition") – mvasyliv Aug 02 '22 at 07:35
  • I don't want to write the partitionBy column into my hdfs files. – doki Aug 02 '22 at 08:50
  • 1
    Spark partitionBy() is a function of pyspark.sql.DataFrameWriter class which is used to partition based on one or multiple column values while writing DataFrame to Disk/File system. When you write Spark DataFrame to disk by calling partitionBy(), PySpark splits the records based on the partition column and stores each partition data into a sub-directory. – mvasyliv Aug 02 '22 at 09:42

0 Answers0