I'm working on Big Data project. We are using flume to download files from sftp to HDFS. Then, we configured 3 agents. They read from the same source. As consequence, we get 3 duplicated files into HDFS, which is not good. Whereas, we must have only one file. However, we need to keep a traceability for a processed files, and manage the concurrency between agent. As an example, we have 3 main agent A1, A2, and A3. If a file xxx.csv is processed or in process by the agent A2. The others will not process it, and will look for unprocessed files. So, each file must be processed by only one Agent.
Is there someone worked on similar issue?