I have been struggling for days to find a solution and so I'm hoping someone with more Algebird experience can help!
I have a stream of events I'm aggregating using Algebird, where each event represents an attempt to perform some task. Consider the following data structure to represent each attempt:
class TaskAttempt {
val taskId: String
val time: Int
val type: String
val value: Long
val valueUnit: String
}
I am aggregating these attempts from a stream, and there is no guarantee that an attempt to perform a task will succeed. In the case that an attempt fails, I expect additional attempts for the same task. The aggregation I'm trying to build does the following:
- Collect only the most recent attempt (based on the
TaskAttempt.time
field) for each task ID. Assume larger values forTaskAttempt.time
mean the event happened more recently. All previous events for each task will be ignored. - Sum the
TaskAttempt.value
field from theTaskAttempt
instances collected in step 1 into aMap(type -> Map(valueUnit -> valueSum))
. This means that in the end, all values from each most recent task attempt will be summed if theirtype
andvalueUnit
fields are equal.
I was hoping to accomplish the above using something like the following, but .flatMap()
cannot be called on an Algebird Preparer
after calling .reduce()
because the latter returns a MonoidAggregator
rather than a Preparer
. Regardless, here is some non-working code to further show what I'd like to accomplish:
Preparer[NetworkAttemptSubmissionPrediction]
// Aggregate attempts into Sets
.flatMap { attempt =>
for {
a <- attempt
} yield Set(a)
}
// Reduce by grouping TaskAttempt's by taskId and then collecting the
// attempts with the largest value for time for each taskId
.reduce {
(
l1: Set[TaskAttempt],
l2: Set[TaskAttempt]
) =>
(l1 ++ l2)
.groupBy(_.taskId)
.flatMap(entry: (String -> List[TaskAttempt]) => entry._2.maxBy(_.time))
.toSet
}
// Map the remaining filtered attempts to the required Map
.flatMap { attempt =>
for {
value <- attempt.value
} yield Map(
attempt.type -> Map(attempt.valueUnit -> value)
)
}
.sum
Ultimately, I must provide the framework I'm using for the stream aggregation (internal tool built on top of Twitter's Summingbird) with a MonoidAggregator[TaskAttempt, Map[String, Map[String, Long]], Map[String, Map[String, Long]]
that aggregates the data as described above. How can I accomplish this? Any other ideas for how I could make this work?