I have a dataset with a milliseconds timestamp field, like create_time
of bigint
type. And I want to write them into different hdfs directories with a formatted partition suffix like xxx/pt=20220101
, xxx/pt=20220102
... It looks like that partitionBy
method of
DataFrameWriter
works for this. But it cannot get a udf(format the bigint milliseconds into a YYYYMMDD date) as its input which is not presented in the schema of the child plan.
For example,
spark.sql("select df1.*,df2.* from df1 join df2 on df1.key1 = df2.key2")
.write
.option("header", value = true)
.mode(SaveMode.Overwrite)
.partitionBy("my_udf(df2.key2)")
.csv("./test/data")
It failed to run because spark cannot get that expression's type from df1.*
and df2.*
.
Is there any way to solve this problem?