Parallel data fetches that are batched

Question

Is this pattern of batching a subset of a collection for parallel processing ok? Is there a better way to do this that I am missing?

When given a collection of entity ids that need to be fetched from a service which returns a scala Future instead of making all the requests at once we batch them because the service can only handle a certain number of requests at a time. In a way it is a primitive throttling mechanism to avoid overwhelming the data store. It looks like a code smell.


object FutureHelper{
  def batchSerially[A, B, M[a] <: TraversableOnce[a]](l: M[A])(dbFetch: A => Future[B])(
    implicit ctx: ExecutionContext, buildFrom: CanBuildFrom[M[A], B, M[B]]): Future[M[B]] =
    l.foldLeft(Future.successful(buildFrom(l))){
      case (accF, curr) => for {
        acc <- accF
        b <- dbFetch(curr)
      } yield acc += b
    }.map(s => s.result())
}

object FutureBatching extends App {
  implicit val e: ExecutionContext = scala.concurrent.ExecutionContext.Implicits.global

  val entityIds = List(1,2,3,4,5,6)
  val batchSize = 2

  val listOfFetchedResults =
    FutureHelper.batchSerially(entityIds.grouped(batchSize)) {groupedByBatchSize =>
      Future.sequence{
        groupedByBatchSize.map( i => Future.successful(i))
      }
    }.map(_.flatten.toList)
}

shj · Accepted Answer · 2019-04-04T19:57:20.757

I believe by default scala.Future will start executing as soon as the Future is created, so the invocations of dbFetch() will kick-off the connections right away. Since the foldLeft transforms all the suspended A => Future[B] to the actual Future objects, I don't believe the batching will happen the way you want.

Yes, I believe that code works correctly (see comments).

Another way is to let the pool define the level of parallelism, but that doesn't always work, depending on your execution environment.

I've had some success doing batching using the parallel collections. For instance, if you create a collection where the number of elements represent the number of concurrent activities, you can use .par. For instance,

// partition xs into numBatches Set elements, and invoke processBatch on each Set in parallel
def batch[A,B](xs: Iterable[A], numBatches: Int)
  (processBatch: Set[A] => Set[B]): ParSeq[B] = split(xs,numBatches).par.flatMap(processBatch)

// Split the input iterable into numBatches sub-sets.
// For example split(Seq(1,2,3,4,5,6), 3) =  Seq(Set(1, 4), Set(2, 5), Set(3, 6))
def split[A](xs: Iterable[A], numBatches: Int): Seq[Set[A]] = {
    val buffers: Vector[VectorBuilder[A]] = Vector.fill(numBatches)(new VectorBuilder[A]())
    val elems = xs.toIndexedSeq
    for (i <- 0 until elems.length) {
      buffers(i % numBatches) += elems(i)
    }
    buffers.map(_.result.toSet)
}

Thanks for your answer! I think your solution is a good alternative, however, the invocations of dbFetch in my original code are happening inside the for comprehension, so wouldn't that prevent the Futures from executing immediately? If the Futures were created outside the for comprehension, they would happen in parallel. This post gives a good example https://stackoverflow.com/questions/19045936/scalas-for-comprehension-with-futures — gJohn, Apr 03 '19 at 20:12

Parallel data fetches that are batched

1 Answers1