Dears,
I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:
- I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
- Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
- The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
- I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.
I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.
What can be done better in the above design?
Thanks in advance...