I am trying to read latest files(say new files in last one hour) available in a directory and load that data . I am trying with pyspark structured streaming. i have tried maxFileAge option of spark streaming, but still it is loading all the files in the diretory, regardless of time specified in the option.
spark.readStream\
.option("maxFileAge", "1h")\
.schema(cust_schema)\
.csv(upload_path) \
.withColumn("closing_date", get_date_udf_func(input_file_name()))\
.writeStream.format('parquet') \
.trigger(once=True) \
.option('checkpointLocation', checkpoint_path) \
.option('path', write_path) \
.start()
Above is the code that i tried, but it will load all available files regardless of time . Please point out what i am doing wrong here ..