3

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment. The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.

My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.

Any suggestions are much appreciated

Alex
  • 542
  • 5
  • 24

2 Answers2

1

Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.

Gadi
  • 53
  • 8
  • 1
    Which can be compensated for with contabs / Schedules and logging into a centralized database. It works, it just needs a little more doing. – tobi6 Sep 09 '16 at 09:10
  • 1
    what do you mean by monitoring? If you have 100+ jobs with 50+ steps each then TAC will not provide any monitoring functionality. Then you have to do what @tobi6 said, log into a centralized database and report out of that. – Balazs Gunics Sep 09 '16 at 11:40
1

For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

Alex
  • 542
  • 5
  • 24