1

Usually SCD Type 2 implemented using ETL but is it possible to do this using realtime data processing, like for example Spark Streaming or KSQL?

ant0nk
  • 134
  • 1
  • 7
  • Spark related? Or other? – thebluephantom Aug 25 '20 at 16:42
  • Any streaming solution would fit. – ant0nk Aug 25 '20 at 16:48
  • Not sure spark would in way you likely think – thebluephantom Aug 25 '20 at 16:52
  • Question is too vague – thebluephantom Aug 25 '20 at 19:05
  • 1
    Basically the question is how to build conventional datawarehouse but using online data source, like CDC streaming to Kafka. Of course it is possible to put online data to some kind of stage at first and then process if with ETL. But I'd like to know how to do it with stream data processing. – ant0nk Aug 25 '20 at 19:44
  • Ok clearer but very difficult. – thebluephantom Aug 25 '20 at 20:01
  • Well I have done this but we put it to staging tables in data lake and then processed in batch using scala / spark. If you wanted to do real time then spark is not handy imho. although it has all the aspects to deal with things. doing on the fly is difficult and what if you have fact and dimension data and process facts before dimension crreated? – thebluephantom Aug 26 '20 at 17:07
  • "what if you have fact and dimension data and process facts before dimension crreated?" - I don't see how this is possible with reliable source and reliable CDC transport. But even if this is possible we can create PK in bridge table for the same business key wherever it came from - from dimension source or from fact source and then use it for whatever come first - dimension or fact. – ant0nk Aug 28 '20 at 07:38
  • Not everything is reliable. My experience. In general correct but the exception proves the rule – thebluephantom Aug 28 '20 at 08:23

0 Answers0