0

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".

Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.

Any suggestions on best practices for using flink for batch ETL activities.

May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?

Vishal
  • 11
  • 2
  • Could you please update your question with the frequency that these ETL jobs are executed? Depending on the answers, I'd have different recommendations. – Arvid Heise Aug 19 '20 at 08:22
  • "i have handful jobs per day of each job type" and several such job types. Total jobs are in 100s mostly related to updating the data from one system to other – Vishal Aug 20 '20 at 03:30

1 Answers1

0

Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:

Session cluster

A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.

Benefits:

  • No additional cluster deployment time needed when submitting jobs => Faster job submissions
  • Better resource utilization if individual jobs don't need many resources
  • One place to control all your jobs

Downsides:

  • No strict isolation between jobs
    • Failures caused by job A can cause job B to restart
    • Job A runs in the same JVM as job B and hence can influence it if statics are used

Per-job cluster

A per-job cluster starts a dedicated Flink cluster for every job.

Benefits

  • Strict job isolation
  • More predictable resource consumption since only a single job runs on the TaskExecutors

Downsides

  • Cluster deployment time is part of the job submission time, resulting in longer submission times
  • Not a single cluster which controls all your jobs

Recommendation

So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.

Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
  • Thank you for details. Clarifies my doubts. One last question on this "Is there better looking GCP dataflow like monitoring UI ?" The flink dashoard is great but appears to need more tech skills as compared to typical job monitoring skill set. – Vishal Aug 20 '20 at 03:25
  • Flink only comes with the existing web UI. What you can do is to build your own UI using Flink's REST API. I would be interested why you feel that Flink's web UI is harder to use than GCP's UI. – Till Rohrmann Aug 20 '20 at 05:17
  • To understand and debug execution plan displayed on flink dashboard, one needs to be well versed with map reduce operations and able to correlate them with business steps. The data flow shows the enlgish text associated with each step, hence bsae user can found out which step is failing (than which operation is failing) Even interpreting metrics will take more time for flink dashboard. Flink dashboard is very powerful tool for developers to debug the issues. The use case I was talking of was hundreds or may be thousands of jobs to be monitored 24x7 where avg tech users monitor jobs. – Vishal Aug 20 '20 at 11:11
  • This is very helpful feedback @Vishal. Thanks a lot! I cannot promise that we will do swift improvements but I always have an open ear to collect ideas how to improve Flink. – Till Rohrmann Aug 21 '20 at 07:32
  • also on related point, if the clusters are getting created at run time, one dashboard per cluster design may needs to be revisited. May be one can think of creating a dashboard one level up. – Vishal Aug 24 '20 at 06:24
  • This is true. However so far the decision was that such a functionality is not the core responsibility of Flink. There are other projects/products out there which solve this problem. In a project as big as Flink one really needs to concentrate on a few things. Otherwise it becomes to unwieldy. – Till Rohrmann Aug 24 '20 at 07:14