11

I recently saw that there is a new tool in GCP known as Data Fusion and looking at it, it seems like it is an easier way of creating ETL pipelines as compared to Dataflow. So can we assume that it is a replacement for Dataflow?

rish0097
  • 1,024
  • 2
  • 18
  • 39

3 Answers3

10

Data Fusion is not a replacement for Dataflow but rather a complementary. It enables Hybrid integration because it is based on an open-source alternative called CDAP. It also has additional metadata and lineage features that are not currently available in Dataflow

Eslam Nawara
  • 645
  • 3
  • 8
7

Cloud data fusion is based on CDAP an open source pipeline development tool. which offers visualization tool to build ETL/ELT pipelines. it supports major Hadoop distributions(MapR, Harotonworks)and Cloud (AWS, GCP,AZURE) to build pipeline. in GCP it uses cloud dataproc cluster to perform jobs and comes up with multiple prebuilt connectors from to connect source to sink. it gives you codeless pipeline development. data fusion is also enterprise ready gives data lineage, metadata management.

How ever Dataflow is fully managed service in GCP based on Apache Beam offers unified programming model to develop pipeline that can execute on a wide range of data processing patterns including ETL, batch computation, and continuous computation. same code can handle batch and realtime processing and has lot of choice to choose the runner for pipeline deployment.

abhay
  • 93
  • 2
  • 6
7

Apache Beam(what Dataflow provides the runtime for) is a unified programming model, meaning it's still "programming", that is - writing code. You have a lot of control over the code, you can basically write whatever you want to tune the data pipelines you create. The "unified" part is about being able to run that code on different runtimes. Can be at least 4 and Dataflow is just one of them. Check the compatibility matrix. You might be overwhelmed.

CDAP(Data Fusion) is, as it seems, more about being able to build a data pipeline without coding at all. The API is available, in case it's needed of course, but the goal is to build as much as possible without coding.

CDAP is quite new and is not widely known(based on github statistics). There were a lot of similar attempts to create codeless integrations back in the ESBs(Enterprise Service Bus) glory days, and while many of them were quite successful, overall they didn't catch on quite as well as many had hoped for. Having said that, since a lot of people compare Data Fusion to Azure's Data Factory, the latter seems to be quite popular on Azure, so it might as well that Google Cloud is trying to close that gap.

yuranos
  • 8,799
  • 9
  • 56
  • 65