Gobblin grouping workunits for Kafka source

Question

In https://gobblin.readthedocs.io/en/latest/case-studies/Kafka-HDFS-Ingestion/#grouping-workunits section of Gobblin documentation we can read about Single-level packing with following desc

The single-level packer uses a worst-fit-decreasing approach for assigning workunits to mappers: each workunit goes to the mapper that currently has the lightest load. This approach balances the mappers well. However, multiple partitions of the same topic are usually assigned to different mappers. This may cause two issues: (1) many small output files: if multiple partitions of a topic are assigned to different mappers, they cannot share output files. (2) task overhead: when multiple partitions of a topic are assigned to different mappers, a task is created for each partition, which may lead to a large number of tasks and large overhead.

Second overhead seems to stand in contradiction to what we can read in the other parts. One paragraph higher we can red

For each partition, after the first and last offsets are determined, a workunit is created.

and here https://gobblin.readthedocs.io/en/latest/Gobblin-Architecture/#gobblin-job-flow in point 3:

From the set of WorkUnits given by the Source, the job creates a set of tasks. A task is a runtime counterpart of a WorkUnit, which represents a logic unit of work. Normally, a task is created per WorkUnit

So for what I understand there always is task associated with Kafka partition unless WorkUnits are grouped together (then we have one task for many WorkUnits thus many paritions)

Do I understand something wrong here or second overhead in single-level packaging make no sens?

Gobblin grouping workunits for Kafka source

0 Answers0