Loading apache server logs to HDFS using Kafka

Question

I want to load apache server logs to hdfs using Kafka.
creating topic:

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew

tailing the apache access log directory:

tail -f  /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew

At another terminal (of kafka bin) start consumer:

./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning

camus.properties file is configured as :

# Needed Camus properties, more cleanup to come  
# final top-level data output directory, sub-directory will be dynamically      created for each topic pulled
etl.destination.path=/user/root/topics
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/user/root/exec
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/user/root/camus/exec/history

# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids

# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)    `camus.message.encoder.class=com.linkedin.camus.etl.kafka.coders.DummyKafkaMessageEncoder`

# Concrete implementation of the Decoder class to use
  #camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder

# Used by avro-based Decoders to use as their Schema Registry
 #kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
    #etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner

# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner

# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=

# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1

# if whitelist has values, only whitelisted topic are pulled.  nothing on the blacklist is pulled
#kafka.blacklist.topics=
kafka.whitelist.topics=lognew
log4j.configuration=true

# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
#kafka.fetch.buffer.size=
#kafka.fetch.request.correlationid=
#kafka.fetch.request.max.wait=
#kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=10.25.3.207:6667
#kafka.timeout.value=


#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5

#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka

# everything below this point can be ignored for the time being, will provide   more documentation down the road
##########################
etl.run.tracking.post=false
#kafka.monitor.tier=
#etl.counts.path=
kafka.monitor.time.granularity=10

etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false

# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy

etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8

mapred.output.compress=true
mapred.map.max.attempts=1

kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000

#zookeeper.session.timeout=
#zookeeper.connection.timeout=

I get errors when i execute the below command:

hadoop jar camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P camus.properties

Below is the error:

[CamusJob] - Fetching metadata from broker 10.25.3.207:6667 with client id camus for 0 topic(s) []
[CamusJob] - failed to create decoder
com.linkedin.camus.coders.MessageDecoderException:     com.linkedin.camus.coders.MessageDecoderException:     java.lang.NullPointerException
    at     com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:28)
    at com.linkedin.camus.etl.kafka.mapred.EtlInputFormat.createMessageDecoder(EtlInputFormat.java:390)
    at com.linkedin.camus.etl.kafka.mapred.EtlInputFormat.getSplits(EtlInputFormat.java:264)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at com.linkedin.camus.etl.kafka.CamusJob.run(CamusJob.java:280)
    at com.linkedin.camus.etl.kafka.CamusJob.run(CamusJob.java:608)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.linkedin.camus.etl.kafka.CamusJob.main(CamusJob.java:572)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: com.linkedin.camus.coders.MessageDecoderException: java.lang.NullPointerException
    at com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder.init(KafkaAvroMessageDecoder.java:40)
    at com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:24)
    ... 22 more
Caused by: java.lang.NullPointerException
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:195)
    at     com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder.init(KafkaAvroMessageDecoder.java:31)
    ... 23 more
[CamusJob] - Discarding topic (Decoder generation failed) : avrotopic
[CamusJob] - failed to create decoder

Please, suggest what can be done to resolve this problem. Thanks in advance

Deepthy

Morgan Kenyon · Answer 1 · 2015-11-17T14:26:53.327

I've never used Camus. But I believe this is a Kafka related error and it has to do with how you're encoding/decoding the message. I believe the important lines in your stack trace are

Caused by: com.linkedin.camus.coders.MessageDecoderException: java.lang.NullPointerException
  at com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder.init(KafkaAvroMessageDecoder.java:40)
  at com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:24)

How are you telling Kafka to use your Avro encoding? You've commented out the following line in your config,

#kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

So are you setting that somewhere else in code? If you're not, I would suggest uncommenting out that config value and setting it to whatever avro class you're trying to decode/encode in.

It might take you some debugging to use the right classpath and such, but I believe this is an easily solvable problem.

EDIT In responding to your comments, I have a couple comments of my own.

I have never used Camus. So debugging the errors you get from Camus is not something that I'll be able to do very well or at all. So you'll have to spend some time (maybe several hours) researching and trying different things to get it to work.
I doubt DummySchemaRegistry is the correct configuration value that you need. Anything starting with Dummy is probably not a valid configuration option.
Doing a simple google search about camus and schema registry revealed some interesting links, SchemaRegistry Classes, KafkaAvroMessageEncoder. Those are more likely to be the correct config values that you need. Just my guess, because again, I've never used Camus.
This could also be of some use to you. I don't know if you've seen it. But if you haven't, I'm pretty sure googling the specific error you get is probably something you should before coming to Stack overflow.

I uncommented the config value `kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry` but getting another error: — Deepthy, Nov 17 '15 at 04:45
`com.linkedin.camus.coders.MessageDecoderException: java.lang.InstantiationException: com.linkedin.camus.example.schemaregistry.DummySchemaRegistry at com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:28) at com.linkedin.camus.etl.kafka.mapred.EtlInputFormat.createMessageDecoder(EtlInputFormat.java:390) at com.linkedin.camus.etl.kafka.mapred.EtlInputFormat.getSplits(EtlInputFormat.java:264) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)` — Deepthy, Nov 17 '15 at 04:55

Loading apache server logs to HDFS using Kafka

1 Answers1