How to read new files uploaded in last one hour using pyspark filestream?

Question

I am trying to read latest files(say new files in last one hour) available in a directory and load that data . I am trying with pyspark structured streaming. i have tried maxFileAge option of spark streaming, but still it is loading all the files in the diretory, regardless of time specified in the option.

spark.readStream\
.option("maxFileAge", "1h")\
.schema(cust_schema)\
    .csv(upload_path) \
    .withColumn("closing_date", get_date_udf_func(input_file_name()))\
    .writeStream.format('parquet') \
    .trigger(once=True) \
    .option('checkpointLocation', checkpoint_path) \
    .option('path', write_path) \
    .start()

Above is the code that i tried, but it will load all available files regardless of time . Please point out what i am doing wrong here ..

I also tried like answer from this link..ie by creating by warm-up stream..but didn't work for me... https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset — josephthomaa, Mar 09 '22 at 07:33
They changed it so that `maxFileAge` will be ignored if `latestFirst` and `maxFilesPerTrigger` are set. You can set the `cleansource` option to "archive" and it will move the read files to a new location so on your next run they are no longer in your upload_path. See: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources — Michael Gardner, May 05 '22 at 02:56

How to read new files uploaded in last one hour using pyspark filestream?

0 Answers0