Log files of my application keep accumulating on a server.I want to dump them into HDFS through KAFKA.I want the Kafka producer to read the log files,send them to Kafka broker and then move those files to another folder.Can the Kafka producer read log files ? Also, is it possible to have the copying logic in Kafka producer ?
3 Answers
- Kafka maintains feeds of messages in categories called topics.
- We'll call processes that publish messages to a Kafka topic producers.
- We'll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:
So this is not a suitable for your application where you want to injest log files. Instead you can try flume.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

- 1,801
- 1
- 13
- 21
As you know, Apache Kafka is publish-subscribe messaging system. you can send message from your application. To send message from your application you can use kafka clients or kafka rest api.
In short, you can read your log with your application and can send these logs to kafka topics.
To handle these logs, you can use apache storm. You can find many integrated solution for these purposes. And by using storm you can add any logic your stream processing.
You can read many useful detailed information about storm kafka integration.
Also to put your processed logs to hdfs, you can easily integrate your storm with hadoop. You can check this repo for it.

- 623
- 7
- 15
Kafka was developed to support high volume event streams such as real-time log aggregation. From the kafka documentation
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption
Also I got this little piece of information from this nice article which almost similar to your use-case
Today, Kafka has been used in production at LinkedIn for a number of projects. There are both offline and online usage. In the offline case, we use Kafka to feed all activity events to our data warehouse and Hadoop, from which we then run various batch analysis

- 8,015
- 5
- 48
- 60