Looking for approach in Apache Samza to read file from local system or HDFS then apply filters, aggregate, where condition, order by, group by into batch of data. Please provide some help.
Asked
Active
Viewed 186 times
2 Answers
0
You should create a system for each source of data you want to use. For example, to read from a file, you should create a system with the FileReaderSystemFactory -- for HDFS, create a system with the HdfsSystemFactory. Then, you can use the regular process callback or windowing to process your data.

Jon Bringhurst
- 1,340
- 1
- 10
- 21
0
You can feed your Samza Job using standard Kafka producer. To make it easy for you. You can use Logstash, you need to create Logstash script where you specify:
- input as local file or hdfs
- filters (optional) here you can do basic filtering, aggregation etc.
- kafka output with specific topic you want to feed
input
I was using this approach to feed my samza job from local file
Another approach could be using Kafka Connect http://docs.confluent.io/2.0.0/connect/

Stefan Repcek
- 2,553
- 4
- 21
- 29