0

I am trying to read latest files(say new files in last one hour) available in a directory and load that data . I am trying with pyspark structured streaming. i have tried maxFileAge option of spark streaming, but still it is loading all the files in the diretory, regardless of time specified in the option.

spark.readStream\
.option("maxFileAge", "1h")\
.schema(cust_schema)\
    .csv(upload_path) \
    .withColumn("closing_date", get_date_udf_func(input_file_name()))\
    .writeStream.format('parquet') \
    .trigger(once=True) \
    .option('checkpointLocation', checkpoint_path) \
    .option('path', write_path) \
    .start()

Above is the code that i tried, but it will load all available files regardless of time . Please point out what i am doing wrong here ..

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
josephthomaa
  • 158
  • 1
  • 11
  • I also tried like answer from this link..ie by creating by warm-up stream..but didn't work for me... https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset – josephthomaa Mar 09 '22 at 07:33
  • They changed it so that `maxFileAge` will be ignored if `latestFirst` and `maxFilesPerTrigger` are set. You can set the `cleansource` option to "archive" and it will move the read files to a new location so on your next run they are no longer in your upload_path. See: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources – Michael Gardner May 05 '22 at 02:56

0 Answers0