Whats the expected commit/rollback behavior of Camus?

Question

We've been running Camus for about a year successfully to pull avro payloads from Kafka (ver 0.82) and store as .avro files in HDFS, using just a few Kafka topics. Recently, a new team within our company registered about 60 new topics in our pre-production environment and started sending data to these topics. The team made some mistakes when routing their data to kafka topics, that resulted in errors when Camus deserialized the payloads to avro for these topics. The Camus job failed due to exceeding the 'failed other' error threshold. The resulting behavior in Camus after the failure was surprising, I wanted to check with other developers to see whether the behavior we observed is expected or whether we have some issue going on with our implementation.

We noticed this behavior when the Camus job failed due to exceeding the 'failed other' threshold: 1. All of the mapper tasks succeeded, and so the TaskAttempt was allowed to commit - this means that all of the data written by Camus was copied to the final HDFS location. 2. The CamusJob throws an exception when it computes the % error rate (this is following the mapper commit), which caused the job to fail 3. Because the job failed (I think), the Kafka offsets weren't advance

The problem we ran into with this behavior is that our Camus job is set to run every 5 minutes. So, every 5 minutes we saw that data was committed to HDFS, the job failed, and the Kafka offsets weren't updated - this meant that we wrote duplicated data until we noticed that our disks were filling up.

I wrote an integration test that confirms the result - it submits 10 good records to a topic, and 10 records that use an unexpected schema to the same topic, runs the Camus job with only that topic whitelisted, and we can see that 10 records are written to HDFS and the Kafka offsets aren't advanced. Below is a snippet of the logs from that test, as well as the properties we used while running the job.

Any help is appreciated - I'm not sure whether this is expected behavior for Camus or whether we have a problem with our implementation, and what the best method is to prevent this behavior (duplicating data).

Thanks ~ Matt

CamusJob properties for the test:

etl.destination.path=/user/camus/kafka/data
etl.execution.base.path=/user/camus/kafka/workspace
etl.execution.history.path=/user/camus/kafka/history
dfs.default.classpath.dir=/user/camus/kafka/libs

etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.AvroRecordWriterProvider
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder

camus.message.timestamp.format=yyyy-MM-dd HH:mm:ss Z
mapreduce.output.fileoutputformat.compress=false

mapred.map.tasks=15
kafka.max.pull.hrs=1
kafka.max.historical.days=3

kafka.whitelist.topics=advertising.edmunds.admax
log4j.configuration=true

kafka.client.name=camus
kafka.brokers=<kafka brokers>
max.decoder.exceptions.to.print=5
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka
kafka.message.coder.schema.registry.class=com.linkedin.camus.schemaregistry.AvroRestSchemaRegistry
etl.schema.registry.url=<schema repo url>
etl.run.tracking.post=false
kafka.monitor.time.granularity=10

etl.daily=daily
etl.ignore.schema.errors=false

etl.output.codec=deflate
etl.deflate.level=6
etl.default.timezone=America/Los_Angeles
mapred.output.compress=false
mapred.map.max.attempts=2

Log snippet from the test, showing the commit behavior after the mappers succeed and subsequent job failure due to surpassing the 'other' threshold:

LocalJobRunner] - advertising.edmunds.admax:2:6; advertising.edmunds.admax:3:7 begin read at 2016-07-08T05:50:26.215-07:00; advertising.edmunds.admax:1:5; advertising.edmunds.admax:2:2; advertising.edmunds.admax:3:3 begin read at 2016-07-08T05:50:30.517-07:00; advertising.edmunds.admax:0:4 > map

[Task] - Task:attempt_local866350146_0001_m_000000_0 is done. And is in the process of committing

[LocalJobRunner] - advertising.edmunds.admax:2:6; advertising.edmunds.admax:3:7 begin read at 2016-07-08T05:50:26.215-07:00; advertising.edmunds.admax:1:5; advertising.edmunds.admax:2:2; advertising.edmunds.admax:3:3 begin read at 2016-07-08T05:50:30.517-07:00; advertising.edmunds.admax:0:4 > map

[Task] - Task attempt_local866350146_0001_m_000000_0 is allowed to commit now

[EtlMultiOutputFormat] - work path: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0

[EtlMultiOutputFormat] - Destination base path: /user/camus/kafka/data

[EtlMultiOutputFormat] - work file: data.advertising-edmunds-admax.3.3.1467979200000-m-00000.avro

[EtlMultiOutputFormat] - Moved file from: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0/data.advertising-edmunds-admax.3.3.1467979200000-m-00000.avro to: /user/camus/kafka/data/advertising-edmunds-admax/advertising-edmunds-admax.3.3.2.2.1467979200000.avro

[EtlMultiOutputFormat] - work file: data.advertising-edmunds-admax.3.7.1467979200000-m-00000.avro

[EtlMultiOutputFormat] - Moved file from: file:/user/camus/kafka/workspace/2016-07-08-12-50-20/_temporary/0/_temporary/attempt_local866350146_0001_m_000000_0/data.advertising-edmunds-admax.3.7.1467979200000-m-00000.avro to: /user/camus/kafka/data/advertising-edmunds-admax/advertising-edmunds-admax.3.7.8.8.1467979200000.avro

[Task] - Task 'attempt_local866350146_0001_m_000000_0' done.
[LocalJobRunner] - Finishing task: attempt_local866350146_0001_m_000000_0
[LocalJobRunner] - map task executor complete.
[Job] -  map 100% reduce 0%
[Job] - Job job_local866350146_0001 completed successfully
[Job] - Counters: 23
File System Counters
FILE: Number of bytes read=117251
FILE: Number of bytes written=350942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=10
Map output records=15
Input split bytes=793
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=13
Total committed heap usage (bytes)=251658240
com.linkedin.camus.etl.kafka.mapred.EtlRecordReader$KAFKA_MSG
DECODE_SUCCESSFUL=10
SKIPPED_OTHER=10
File Input Format Counters 
Bytes Read=0
File Output Format Counters 
Bytes Written=5907
total
data-read=840
decode-time(ms)=123
event-count=20
mapper-time(ms)=58
request-time(ms)=12114
skip-old=0
[CamusJob] - Group: File System Counters
[CamusJob] - FILE: Number of bytes read:    117251
[CamusJob] - FILE: Number of bytes written: 350942
[CamusJob] - FILE: Number of read operations:   0
[CamusJob] - FILE: Number of large read operations: 0
[CamusJob] - FILE: Number of write operations:  0
[CamusJob] - Group: Map-Reduce Framework
[CamusJob] - Map input records: 10
[CamusJob] - Map output records:    15
[CamusJob] - Input split bytes: 793
[CamusJob] - Spilled Records:   0
[CamusJob] - Failed Shuffles:   0
[CamusJob] - Merged Map outputs:    0
[CamusJob] - GC time elapsed (ms):  13
[CamusJob] - Total committed heap usage (bytes):    251658240
[CamusJob] - Group: com.linkedin.camus.etl.kafka.mapred.EtlRecordReader$KAFKA_MSG
[CamusJob] - DECODE_SUCCESSFUL: 10
[CamusJob] - SKIPPED_OTHER: 10
[CamusJob] - job failed: 50.0% messages skipped due to other, maximum allowed is 0.1%

score 1 · Answer 1 · answered Jul 27 '16 at 06:08

I'm facing a pretty similar problem: my Kafka/Camus pipeline has been working well for about a year, but recently I stucked with duplication issue while integrating the ingestion from remote broker with very unstable connection and frequent job failures.

Today when examining Gobblin documentation, I realized that Camus sweeper is a tool that possibly what we are looking for. Try to integrate it in your pipeline.

I also think that the good idea would be to migrate to Gobblin (Camus successor) in the nearest future.

Thanks for the response - so far our approach is to disable the error checking in Camus (we don't want to job to fail for any reason). We'll be looking to transition off Camus as soon as we can. — user2994581, Jul 28 '16 at 15:37

Whats the expected commit/rollback behavior of Camus?

1 Answers1