I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples.
I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection). These would trigger the ETL process for that particular business entity. The ETL process would extract datasets from source systems and then pass those results (PCollections) onto the next processing stage. The processing stages would involve various types of joins (including cartesian and non-key joins, e.g. date-banded).
So a couple of questions here:
(1) Is the approach that I'm proposing valid & efficient? If not what would be better, I havent seen any presentations on real-world complex ETL processes using Google Dataflow, only simple scenarios.
Are there any "higher-level" ETL products that are a better fit? I've been keeping an eye on Spark and Flink for a while.
Our current ETL is moderately complex, though there are only about 30 core tables (classic EDW dimensions and facts), and ~1000 transformation steps. Source data is complex (roughly 150 Oracle tables).
(2) The complex non-key joins, how would these be handled?
I'm obviously attracted to Google Dataflow because of it being an API first and foremost, and the parallel processing capabilities seem a very good fit (we are being asked to move from batch overnight to incremental processing).
A good worked example of Dataflow for this use case would really push adoption forward!
Thanks, Mike S