Multiple data collector for a job without duplicating records in streamsets

Question

I have a directory consist of multiple files, and that is shared across multiple data collectors. I have a job to process those files and put it in the destination. Because the records are huge, I want to run the job in multiple data collector. but when I tried I got the duplicate entries in my destination. Is there a way to achieve it without duplicating the records. Thanks

score 1 · Answer 1 · answered Oct 20 '18 at 16:39

You can use kafka for it. For example:

Create one pipeline which reads file names and sends them to kafka topic via kafka producer.
Create pipeline with kafka consumer as an origin and set the consumer group property to it. This pipeline will read filenames and work with files.
Now you can run multiple pipelines with kafka consumer with the same consumer group. In this case kafka will balance messages within consumer group by itself and you will not be getting duplicates.
To be sure that you won't have duplicates also set 'acks' = 'all' property to kafka producer.

With this schema you can run as many collectors as your kafka topic partition count. Hope it will help you.

metadaddy · Answer 2 · 2018-07-19T19:58:40.373

0

Copying my answer from Ask StreamSets:

At present there is no way to automatically partition directory contents across multiple data collectors.

You could run similar pipelines on multiple data collectors and manually partition the data in the origin using different character ranges in the File Name Pattern configurations. For example, if you had two data collectors, and your file names were distributed across the alphabet, the first instance might process [a-m]* and the second [n-z]*.

One way to do this would be by setting File Name Pattern to a runtime parameter - for example ${FileNamePattern}. You would then set the value for the pattern in the pipeline's parameters tab, or when starting the pipeline via the CLI, API, UI or Control Hub.

edited Jul 19 '18 at 19:58

answered Jul 18 '18 at 01:41

metadaddy

4,234
1
22
46

How should I differentiate the **file name pattern** configuration of a pipeline for the data collectors?. If the same pipeline getting processed across the pipeline, then the same configuration applied too, which means the duplicate issue still happens. Right ??? – Tamizharasan Jul 18 '18 at 07:14
Added my clarification from Ask StreamSets – metadaddy Jul 19 '18 at 19:58

Multiple data collector for a job without duplicating records in streamsets

2 Answers2