0

I just started with Camus.

I am planning to run Camus, every one hour. We get around ~80000000 messages every hour and average message size is 4KB (we have a single topic in Kafka).

I first tried with 10 mappers, it took ~2hours to copy one hour's data and it created 10 files with ~7GB size.

Then I tried 300 mappers, it brought down the time to ~1 hour. But it created 11 files. Later, I tried with 150 mappers and it took ~30 minutes.

So, how do I choose the number of mappers in this? Also, I want to create more files in hadoop as one size is growing to 7GB. What configuration do I have to check?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Prachi g
  • 849
  • 3
  • 9
  • 23

2 Answers2

0

It should ideally be equal or less than the kafka partitions you have , in your topic .

That means , for better throughput you topic should have more partitions and same number of camus mappers

-1

I have found best answer in this article

The number of maps is usually driven by the number of DFS blocks in the input files. It causes people to adjust their DFS block size to adjust the number of maps.

The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks.

It is best if the maps take at least a minute to execute.

All depends on the power of CPU you have, the type of application - IO Bound (heavy read/write) Or CPU bound ( heavy processing) and number of nodes in your Hadoop cluster.

Apart from setting number of mappers and reducers at global level, override those values at Job level depending on data to be processing needs of the Job.

And one more thing in the end : If you think Combiner reduces the IO transfers between Mapper and Reducer, use it effectively in combination with Partitioner

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211