0

I have a spark job that is composed as fellows:

1- read static dataFrame from Delta Lake.

2- read a stream of dataFrame from Delta Lake.

3- join the stream with the static.

4- do a flatMapGroupsWithState.

5- write output.

The problem is I have a different output from what I expected, like I lost events on flatMapGroupsWithState. Not only that, but the output is random. When I re-run with the same input, I get different output.

But when I added .coalesce(1) in the writing operation I always got the desired output in LocalMode but not in ClusterMode.

this is the code that I am using:

   val entityScheduleSlots = data
  .withColumn("products", concat(col("batteries"), col("photovoltaics")))
  .drop("photovoltaics", "batteries", "labels")
  .join(
    entities,
    array_contains(entities("entity_delivery_points"), col("delivery_point_id")))
  .withColumn("now", current_timestamp())
  .withWatermark("now", "5 minutes")
  .as(Encoders.product[enrichedDeliveryPointSchedule])
  .groupByKey(e => e.timestamp.toString + e.entity_id.toString + e.schedule_id)(
    Encoders.STRING)
  .flatMapGroupsWithState(
    outputMode = OutputMode.Append,
    timeoutConf = GroupStateTimeout.EventTimeTimeout)(
    Function.computeExplodedEntityScheduleSlots)(
    Encoders.kryo[Function.State],
    Encoders.product[EntityScheduleSlot])

entityScheduleSlots is my output and I did my tests in LocalMode.

object Function {
  case class ProductState(
      var count: Int,
      var quantity: Double,
      var price: Double,
      val sellable: Boolean)
  case class State(var delivery_points_count: Int, var products: mutable.Map[Long, ProductState])
  private def computeExplodedEntityScheduleSlots(
      uid: String,
      ss: Iterator[enrichedDeliveryPointSchedule],
      state: GroupState[State]): Iterator[EntityScheduleSlot] = {
    if (state.hasTimedOut) {
      state.remove()
      return Iterator.empty
    }
    val schedules = ss.toList
    val newState: State =
      state.getOption.getOrElse(State(0, mutable.Map()))
    schedules.foreach(s => {
      newState.delivery_points_count = newState.delivery_points_count + 1
      val qualificationsProductsIDs =
        if (s.entity_qualifications != null) s.entity_qualifications.map(q => q.product)
        else List()
      if (s.products != null) {
        s.products.foreach(p => {
          if (qualificationsProductsIDs.contains(p.product)) {
            val productState =
              newState.products.getOrElse(p.product, ProductState(0, 0.0, 0.0, p.sellable))
            val factor =
              if (productState.count == 0) 1
              else p.quantity / (productState.quantity / productState.count)
            productState.quantity += p.quantity
            productState.price =
              (productState.price * productState.count + p.price * factor) / (productState.count + 1)
            productState.count += 1
            newState.products.update(p.product, productState)
          }
        })
      }
    })
    if (newState.delivery_points_count == schedules.head.entity_delivery_points.length) {
      state.remove()
      return Iterator(
        EntityScheduleSlot(
          timestamp = schedules.head.timestamp,
          entity = schedules.head.entity_id,
          schedule_timestamp = schedules.head.schedule_timestamp,
          schedule_id = schedules.head.schedule_id,
          products =
            if (schedules.head.entity_qualifications != null)
              schedules.head.entity_qualifications
                .map(q => {
                  val product =
                    newState.products.getOrElse(q.product, ProductState(0, 0.0, 0.0, false))
                  EntityScheduleSlotProduct(
                    q.product,
                    product.quantity,
                    product.price,
                    product.sellable)
                })
            else List()))
    }
    state.update(newState)
    val currentWatermarkMs =
      if (state.getCurrentWatermarkMs() > 0) state.getCurrentWatermarkMs()
      else System.currentTimeMillis()
    state.setTimeoutTimestamp(currentWatermarkMs, "2 minutes")
    Iterator.empty
  }
}

case class enrichedDeliveryPointSchedule(
    timestamp: java.sql.Timestamp,
    schedule_timestamp: java.sql.Timestamp,
    schedule_id: String,
    delivery_point_id: Long,
    products: List[DeliveryPointScheduleSlotProduct],
    entity_id: Long,
    entity_delivery_points: List[Long],
    entity_qualifications: List[EntityQualification])

Thank you in advance.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420

1 Answers1

0

You have provided few information, it is difficult to understand the issue, I can give you some tips:

flatMapGroupsWithState is a function you use in stateful Structured Streaming to store a partial result in the internal state of spark

def flatMapGroupsWithState[S: Encoder, U: Encoder](
    outputMode: OutputMode,
    timeoutConf: GroupStateTimeout,
    initialState: KeyValueGroupedDataset[K, S])(
    func: (K, Iterator[V], GroupState[S]) => Iterator[U])

Among the parameters in addition to the initial state is the state update function:

func: (K, Iterator[V], GroupState[S]) => Iterator[U]
  • it could simply be that you are aggregating the result and consequently getting a different output

  • another problem could be that you are checking the output from an executor and given the distributed nature of the spark framework receive partial output,

in the future add code for a better understanding of the problem

afjcjsbx
  • 169
  • 2
  • 8
  • Thank you for your fast response. I edited my question adding the code and the mode. Also the production environment is spark in cluster mode. And I got the same in cluster mode and local mode. Otherwise we have ~40 jobs using `flatMapGroupsWithState` and this is the first with this problem. thank you so much ! – dhia Gharsallaoui Jan 23 '23 at 15:11