2

I have been struggling for days to find a solution and so I'm hoping someone with more Algebird experience can help!

I have a stream of events I'm aggregating using Algebird, where each event represents an attempt to perform some task. Consider the following data structure to represent each attempt:

class TaskAttempt {
    val taskId: String
    val time: Int
    val type: String
    val value: Long
    val valueUnit: String
}

I am aggregating these attempts from a stream, and there is no guarantee that an attempt to perform a task will succeed. In the case that an attempt fails, I expect additional attempts for the same task. The aggregation I'm trying to build does the following:

  1. Collect only the most recent attempt (based on the TaskAttempt.time field) for each task ID. Assume larger values for TaskAttempt.time mean the event happened more recently. All previous events for each task will be ignored.
  2. Sum the TaskAttempt.value field from the TaskAttempt instances collected in step 1 into a Map(type -> Map(valueUnit -> valueSum)). This means that in the end, all values from each most recent task attempt will be summed if their type and valueUnit fields are equal.

I was hoping to accomplish the above using something like the following, but .flatMap() cannot be called on an Algebird Preparer after calling .reduce() because the latter returns a MonoidAggregator rather than a Preparer. Regardless, here is some non-working code to further show what I'd like to accomplish:

Preparer[NetworkAttemptSubmissionPrediction]
  // Aggregate attempts into Sets
  .flatMap { attempt =>
    for {
      a <- attempt
    } yield Set(a)
  }
  // Reduce by grouping TaskAttempt's by taskId and then collecting the
  // attempts with the largest value for time for each taskId
  .reduce {
    (
        l1: Set[TaskAttempt],
        l2: Set[TaskAttempt]
    ) =>
      (l1 ++ l2)
        .groupBy(_.taskId)
        .flatMap(entry: (String -> List[TaskAttempt]) => entry._2.maxBy(_.time))
        .toSet
  }
  // Map the remaining filtered attempts to the required Map
  .flatMap { attempt =>
    for {
      value <- attempt.value
    } yield Map(
      attempt.type -> Map(attempt.valueUnit -> value)
    )
  }
  .sum

Ultimately, I must provide the framework I'm using for the stream aggregation (internal tool built on top of Twitter's Summingbird) with a MonoidAggregator[TaskAttempt, Map[String, Map[String, Long]], Map[String, Map[String, Long]] that aggregates the data as described above. How can I accomplish this? Any other ideas for how I could make this work?

Josh Diaz
  • 21
  • 3

1 Answers1

0

I decided that rather than attempting to dedupe, I should avoid the need to dedupe altogether. I did this by adding additional "negative" task attempts to the topic which negate failed "positive" task attempts that come before them in the stream. By doing this, I can sum all of the events in the stream without worry of double counting due to the presence of multiple attempts for a single task.

Josh Diaz
  • 21
  • 3