2

I have a working solution, but I'm looking for some ways of doing this safer and in a better way.

Every time the job starts up, it looks up a custom checkpoint which indicates from which date should the processing start. From a source dataframe I create one that starts from the specified start date - based on the checkpoint. The solution now limits the rows of the dataframe that has to be processed:

val readFormat = "delta"
val sparkRead = spark.read.format(readFormat)

val fileFormat = if (readFormat == "delta") "" else "." + readFormat
val testData = sparkRead
                  .load(basePath + "/testData/table_name" + fileFormat)
                  .where(!((col("size") < 1)))
                  .where($"modified" >= start)
                  .limit(5000)

For each identifier I download files from Azure Storage, and save the content in a new column of the dataframe:

val tryDownload = testData
                    .withColumn(
                        "fileStringPreview",
                        downloadUDF($"id"))
                     .withColumn(
                            "status",
                            when(
                              (($"fileStringPreview"
                                .startsWith("failed:") === true) ||
                               ($"fileStringPreview"
                                .startsWith("emptyUrl") === true)),
                              lit("failed")).otherwise(
                              lit("succeeded")))

When this is done, the checkpoint is updated by the latest modified date from the elements that are processed in this iteration.

def saveLatest(saved_df: DataFrame, timeSeriesColName: String): Unit = {
val latestTime = saved_df.agg(max(timeSeriesColName)).collect()(0)
try {
  val timespanEnd = latestTime.getTimestamp(0).toInstant().toEpochMilli()
  saveTimestamp(timespanEnd) // this function actually stores the data
} catch {
  case e: java.lang.NullPointerException => {
    LoggingWrapper.log("timespanEnd is null");
  }
}

}

saveLatest(tryDownload, "modified")  

I'm worried about this limit(5000) solution, is there a better way, that keeps a good performance of downloading the specified number of files in each iterations?

Thank you for the suggestions in advance! :)

Krzysztof Atłasik
  • 21,985
  • 6
  • 54
  • 76
Eve
  • 604
  • 8
  • 26

0 Answers0