0

I have just started with Camus. I am planning to run camus job every hour. We get ~80000000 messages (with ~4KB avg size) every hour.

How do I set the following properties:

# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3

I am not able to make out these configurations clearly. Should I put days as 1 and and hours property as 2? How does camus pull the data? Often I see the following error also:

ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset

Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset.

How do I set the configurations correctly to run every hour and avoid that error?

Prachi g
  • 849
  • 3
  • 9
  • 23

1 Answers1

1

"Offset range from kafka metadata is outside the previously persisted offset ."

Indicates that your fetching is not as fast as the kafka's pruning.

kafka's pruning is defined by log.retention.hours.

1st option :Increase the retention time by changing "log.retention.hours"

2nd Option :Run it with higher frequency.

3rd Option :Set in your camus job kafka.move.to.earliest.offset=true. This property will force camus to start consuming from the earliest offset currently present in the kafka. But this may lead to data loss since we are not accounting for the pruned data which we were not able to fetch.