I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:
- We have applications that produce log files ranging from a few MB to hundreds of MB per file.
- We do not want to stream all the log events as they occur.
- Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
- This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
- Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great
I came up with the following evaluation:
- Elastic's Logstash and FileBeats
- Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
- Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
- Both tools are proven to be stable in production-level environments
- Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
- Apache NiFi w/ MiNiFi
- The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
- MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
- Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
- Centralized pipeline management within the NiFi UI
- writing to a HDFS sink is available out-of-the-box
- Community looks active, development is lead by Hortonworks (?)
- We have made good experiences with NiFi in the past
- Apache Flume
- writing to a HDFS sink is available out-of-the-box
- Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
- No centralized pipeline management?
- Apache Gobblin
- writing to a HDFS sink is available out-of-the-box
- No centralized pipeline management?
- No lightweight edge node "agents"?
- Fluentd
- Maybe another tool to look at? Looking for your comments on this one...
I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.
Have I forgotten any broadly used tool that is able to solve this use case?