How to manage the concurrency between flume agent.

Question

I'm working on Big Data project. We are using flume to download files from sftp to HDFS. Then, we configured 3 agents. They read from the same source. As consequence, we get 3 duplicated files into HDFS, which is not good. Whereas, we must have only one file. However, we need to keep a traceability for a processed files, and manage the concurrency between agent. As an example, we have 3 main agent A1, A2, and A3. If a file xxx.csv is processed or in process by the agent A2. The others will not process it, and will look for unprocessed files. So, each file must be processed by only one Agent.

Is there someone worked on similar issue?

What type of source do you use? – gorros Jun 23 '17 at 09:58 — gorros, Jun 23 '17 at 09:58

score 1 · Answer 1 · answered Jun 23 '17 at 09:58

1

You can have one source and 3 sinks with load balancing sink processor.

answered Jun 23 '17 at 09:58

gorros

1,411
1
18
29

How to manage the concurrency between flume agent.

1 Answers1