How to parallelize a REST API crawler in http4s & fs2?

Question

I wrote a sequential REST API crawler in http4s & fs2 here:

https://gist.github.com/NicolasRouquette/656ed7a2d6984ce0995fd78a3aec2566

This is to query a REST API service to get a starting set of IDs, fetch elements for a batch of IDs and continue based on the cross-reference IDs found in these elements until there are no new IDs to fetch and return a map of all elements fetched.

This works; however, the performance is inadequate -- too slow!

Since I don't have access to the server, I tried experimenting with varying batch sizes, from 10, 50, 100, 200, 500 and even batching all IDs in a single query. Query time increases significantly with batch size. At large sizes (500 and all), I even got HTTP 500 responses from the server.

I would like to experiment with batching parallel queries in a load-balancing fashion using a pool of threads; however, it is unclear to me how to do this based on the fs2 docs.

Can someone provide suggestions how to achieve this?

Regarding using http4s & fs2: Well, I found this library fairly easy to use for simple client-side programming. Given the emphasis on supporting tasks, streams, etc..., I presume that batching parallel queries should be doable somehow.

score 1 · Answer 1 · answered Sep 15 '17 at 11:56

fs2.concurrent.join will allow you to run multiple streams concurrently. The specific section in the guide is available at https://github.com/functional-streams-for-scala/fs2/blob/v0.9.7/docs/guide.md#concurrency

For your use case you could take your queue of ids, chunk them, create a http task and then wrap it in a stream. You would then run this stream of streams concurrently with join and combine the results.

def createHttpRequest(ids: Seq[ID]): Task[(ElementMap, Set[ID])] = ???

def fetch(queue: Set[ID]): Task[(ElementMap, Set[ID])] = {
  val resultStreams = Stream.emits(queue.toSeq)
    .vectorChunkN(batchSize)
    .map(createHttpRequest)
    .map(Stream.eval)

  val resultStream = fs2.concurrent.join(maxOpen)(resultStreams)
  resultStream.runFold((Map.empty[ID, Element], Set.empty[ID])) {
    case ((a, b), (_a, _b)) => (a ++ _a, b ++ _b)
  }
}

How to parallelize a REST API crawler in http4s & fs2?

1 Answers1