0

I want to publish a long list of events into Kafka consuming an fs2.Stream that corresponds to a very big list of DB rows that will eventually cause an Out Of Memotry error if compiled to List.

So, let's say that I have a - very big - list of UUID keys with millions of records:

def getKeyStream(timeRangeEnd: LocalDateTime): fs2.Stream[doobie.ConnectionIO, UUID]

and that I want to publish an event into Kafka corresponding to a chunk of 500 keys using this Publisher:

trait KeyPublisher {
  def publish(event: ChunkOfKeys): IO[Long]
}

I would like to create a function to enqueue/publish this stream into Kafka:

def enqueueKeyStreamIntoKafka(endDateTime: LocalDateTime): IO[Unit] = {
  getKeyStream(endDateTime)
     .chunkN(500)
     .evalMap(myChunk => ?????)
     ...
}

How can I consume the stream originated by the DB, split it into chunks of constant size, and then publish each of them into Kafka?

It's quite hard to find good documentation or examples about this topic apparently. Could you please point me in the right direction?

sentenza
  • 1,608
  • 3
  • 29
  • 51
  • Not sure what the question is? Do you know how to publish a single element into kafka? Do you know how to publish a batch of elements into kafka? If so, the just do that on that `evalMap` if not, then everything in this question is unnecessary. – Luis Miguel Mejía Suárez Jan 17 '21 at 14:51
  • Normally, I would compile the stream coming from the DB to map over it as a List (in memory) and then I would use `publish()` to push one or more elements to Kafka. So, I know how to do that. Now I need to use FS2 streams and I came here to gather information or even an example of what can be done. – sentenza Jan 17 '21 at 15:49
  • 2
    It can, it's literally a one-liner, which is why @LuisMiguelMejíaSuárez ask what is the problem. It's something like `valuesFromDB.evalMap(publishSingeEvent).drain.compile` or `valuesFromDB.through(publishEventSink).drain.compile` depending on the exact implementations and signatures. – Mateusz Kubuszok Jan 17 '21 at 15:49
  • 1
    So you can convert a chuck into list, so you can just do that and then publish that list to kafka inside the `evalMap`, to finally `compile.drain` and that would give you an **IO[Unit]** that is sending the data in batches to kafka and you would be done. – Luis Miguel Mejía Suárez Jan 17 '21 at 16:28

1 Answers1

3

Since you don't say what type ChunkOfKeys is, I'm going to assume it's something like Chunk[UUID]

def enqueueKeyStreamIntoKafka(endDateTime: LocalDateTime)(
    xa: Transactor[IO],
    publisher: KeyPublisher
): IO[Unit] =
  getKeyStream(endDateTime)
    .transact(xa) // Convert the ConnectionIO stream to Stream[IO, UUID]
    .chunkN(500)  // into Stream[IO, Chunk[UUID]]
    .evalMap(publisher.publish)  // Into Stream[IO, Long]
    .compile
    .drain // An IO[Unit] that describes the whole process
Daenyth
  • 35,856
  • 13
  • 85
  • 124