1

I have a batch-processing transaction data transformation/validation pipeline written in a Scala Databricks notebook, and when the pipeline is finished, it dumps my validated data into a SQL server for later use.

Ongoing requirements are beginning to focus on transaction metadata, facts about transactions rather than in-situ data transformations. These will live on a separate table in the SQL server, but will retain a 1:1 relationship with transactions.

Currently, in the spark session, a unique index ID is assigned to each transaction in the batch with this function

  def addColumnIndex(df: DataFrame) = {
    spark.sqlContext.createDataFrame(
      df.rdd.zipWithIndex.map {
        case (row, index) => Row.fromSeq(row.toSeq :+ index)
      },
      StructType(df.schema.fields :+ StructField("index", LongType, false)))
  }

This helps some of our transformations but, because we're batch-processing, IDs are not unique across batches. Our strategy is to also have a batchID on each transaction, and then use a composite key in the sql server, a combination batch and index. This ignores the native identity column on the table in SQL.

When writing to SQL, we use a simple

.write .mode(SaveMode.Append) .jdbc(jdbcUrl, "Transactions", connectionProperties)

On insertion, transactions are given a unique identity ID by the sql server. Is there a way to reliably use this ID for the relationship between transactions and fact tables? Is the composite key approach overcomplicating this?

Basically, can we make Spark respect the unique IDs it creates?

Blue
  • 163
  • 1
  • 12

0 Answers0