I have a working solution, but I'm looking for some ways of doing this safer and in a better way.
Every time the job starts up, it looks up a custom checkpoint which indicates from which date should the processing start. From a source dataframe I create one that starts from the specified start date - based on the checkpoint. The solution now limits the rows of the dataframe that has to be processed:
val readFormat = "delta"
val sparkRead = spark.read.format(readFormat)
val fileFormat = if (readFormat == "delta") "" else "." + readFormat
val testData = sparkRead
.load(basePath + "/testData/table_name" + fileFormat)
.where(!((col("size") < 1)))
.where($"modified" >= start)
.limit(5000)
For each identifier I download files from Azure Storage, and save the content in a new column of the dataframe:
val tryDownload = testData
.withColumn(
"fileStringPreview",
downloadUDF($"id"))
.withColumn(
"status",
when(
(($"fileStringPreview"
.startsWith("failed:") === true) ||
($"fileStringPreview"
.startsWith("emptyUrl") === true)),
lit("failed")).otherwise(
lit("succeeded")))
When this is done, the checkpoint is updated by the latest modified date from the elements that are processed in this iteration.
def saveLatest(saved_df: DataFrame, timeSeriesColName: String): Unit = {
val latestTime = saved_df.agg(max(timeSeriesColName)).collect()(0)
try {
val timespanEnd = latestTime.getTimestamp(0).toInstant().toEpochMilli()
saveTimestamp(timespanEnd) // this function actually stores the data
} catch {
case e: java.lang.NullPointerException => {
LoggingWrapper.log("timespanEnd is null");
}
}
}
saveLatest(tryDownload, "modified")
I'm worried about this limit(5000) solution, is there a better way, that keeps a good performance of downloading the specified number of files in each iterations?
Thank you for the suggestions in advance! :)