How to read file in Apache Samza from local file system and hdfs system

Question

Looking for approach in Apache Samza to read file from local system or HDFS then apply filters, aggregate, where condition, order by, group by into batch of data. Please provide some help.

score 0 · Answer 1 · answered Feb 15 '17 at 18:27

You should create a system for each source of data you want to use. For example, to read from a file, you should create a system with the FileReaderSystemFactory -- for HDFS, create a system with the HdfsSystemFactory. Then, you can use the regular process callback or windowing to process your data.

score 0 · Answer 2 · answered Mar 08 '17 at 10:17

You can feed your Samza Job using standard Kafka producer. To make it easy for you. You can use Logstash, you need to create Logstash script where you specify:

input as local file or hdfs
filters (optional) here you can do basic filtering, aggregation etc.
kafka output with specific topic you want to feed

input

I was using this approach to feed my samza job from local file

Another approach could be using Kafka Connect http://docs.confluent.io/2.0.0/connect/

How to read file in Apache Samza from local file system and hdfs system

2 Answers2