Distributed execution in StreamSets

Question

I want to understand how to work StreamSets Data Collector. What's happen when the Streamsets pipeline is executed?
Does it have a distributed execution and master and worker processes? Which components response for master and worker processes? And whats inside? I read the documentation - https://streamsets.com/documentation/controlhub/3.3.2/installhelp/controlhub/InstallationGuide/InstallationOverview/Architecture.html For example Apache Flink uses ActorSystems. Cant find information, could you help me?

score 0 · Answer 1 · answered Mar 12 '20 at 16:19

StreamSets Data Collector is a single Java application with a web front end. You design a pipeline, and it is saved as JSON. When you run the pipeline, the execution engine (part of that same Java app) loads the JSON representation, reads data into memory from the configured data source, manipulates it in memory according to the processors you have configured, and writes it to one or more destinations.

StreamSets Control Hub provides a centralized web front end where you can again design your pipelines, but in this case, you can have one or more Data Collectors connected to Control Hub and dispatch jobs to Data Collector instances based on your configuration - for example, in Control Hub you can start a job to execute a pipeline on 2 Data Collector instances with the dev label. Control Hub also contains a central, versioned, pipeline repository and allows you to compose topologies comprising multiple pipelines, each feeding the next.

Distributed execution in StreamSets

1 Answers1