1

Dears,

I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:

  • I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
  • Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
  • The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
  • I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.

I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.

What can be done better in the above design?

Thanks in advance...

Cengiz
  • 303
  • 2
  • 9

1 Answers1

0

Your pipeline design looks good.

However I would recommend consolidating several of those steps using Striim.

  • Striim has built in CDC (change data capture) from all the sources you listed plus databases
  • It has native kafka integration so you can write to and read from kafka in the same pipeline
  • Striim also has built in caches and processing operators for enrichment. That way you don't need to write Spark code to do enrichment. Everything is done through our simple UI.

You can try it out here:

https://striim.com/instant-download

Full disclosure: I'm a PM at Striim.

capkutay
  • 183
  • 2
  • 11