Why we need TFX if we have airflow for orchestration

Question

I still don't get why we need TFX. TFX will convert your defined pipeline to Airflow DAG and run it on airflow, I could just write my pipelines in python and use Airflow's PythonOperator to build a pipeline directly right? why bother learning another wrapper on top of it? What else TFX offers that cannot be done by just using airflow+TF+Spark/Beam

score 0 · Answer 1 · answered Aug 29 '22 at 07:32

I could just write my pipelines in python and use Airflow's PythonOperator to build a pipeline directly right?

You can! Depending on how you define a pipeline of course.

Here is the definition of TFX, from it's guide:

"TFX is a Google-production-scale machine learning (ML) platform based on TensorFlow. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system."

And to make a Production ML System

according to engineers at Tensorflow.

So, if you can define a whole system where you are able cover all these steps in Airflow DAG's, sure you don't need TFX.

P.S. :

It comes down to the problem you are trying to solve. Here are some questions to think about.

Do you have the data needed at hand, is it valuable?
Do you need to adjust it before giving it to a model?
Which model should you use?
Are you going to re-train the model as you get new data? If so what is the period of this process should be?
As you are doing inference - or serving your model - how are you going to use the predicted results?
What is your threshold for evaluating the success of your service? What metrics should you use?

To learn more, you can check here.

score 0 · Answer 2 · answered Oct 26 '22 at 09:05

TFX is a Google-production-scale machine learning (ML) platform and provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system. For more details, please refer to official documentation.

Advantages of using TFX:

Including Apache Airflow, TFX is designed to be portable to multiple environments and orchestration frameworks such as Apache Beam and Kubeflow Pipelines.
TFX provides set of standard components and libraries and the base
functionality for many of the standard components which helps
implementing ML pipeline.

score 0 · Answer 3 · answered Oct 31 '22 at 10:13

tl;dr Airflow is not an effective scheduler for Continuous Training Pipelines, TFX is using it as a fallback mechanism for engineers who do not need the full power of CTP.

TFX shines when it comes to CTP. They need to have data-driven execution of their components, to:

operate asynchronously at different iteration intervals
reuse results from previous runs

CTP cannot be implemented effectively as the repeated execution of one-off pipelines at scheduled intervals, aka task-based, such as Airflow.

Developers who are used to seeing jobs execute in sequence, as they were defined in a directed acyclic graph (DAG), are not accustomed to runs being triggered by the presence of a specific configuration of artifacts, as represented by the pipeline state. As a solution, TFX introduces a framework that allows users to specify job dependency as they would in a taskbased orchestration system. This also allows users of the open source version of TFX to orchestrate their TFX pipelines with task-based orchestration systems like Apache Airflow.

Reference for gray text: Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform

CTP is achieved by the use of TFX components, which are released via the Apache license. — Theofilos Papapanagiotou, Feb 12 '23 at 01:17

score -1 · Answer 4 · answered Jun 24 '23 at 15:48

While it is true that you can use Airflow's PythonOperator to build and orchestrate your machine learning pipelines directly, TFX offers additional benefits and capabilities that can enhance the development and deployment of ML pipelines. Here are some reasons why you might consider using TFX:

End-to-End ML Platform: TFX is designed as an end-to-end machine learning platform that covers the entire ML lifecycle, from data ingestion to model deployment and monitoring. It provides a standardized and integrated set of components and workflows for ML pipeline development, including data preprocessing, model training, model validation, and model serving.

Standardized Components: TFX provides pre-built and standardized components specifically designed for ML tasks. These components, such as data validation, feature engineering, and model analysis, offer consistent and reusable building blocks for ML pipelines. They help promote best practices, reduce development time, and ensure consistency across different ML projects.

Integration with TensorFlow Ecosystem: TFX seamlessly integrates with the TensorFlow ecosystem, including TensorFlow for model training and serving, TensorFlow Data Validation (TFDV) for data analysis, and TensorFlow Model Analysis (TFMA) for model evaluation. This tight integration enables efficient data processing, scalable training, and easy integration with other TensorFlow-based tools.

Data Validation and Schema Inference: TFX provides capabilities for data validation and schema inference using TFDV. It helps identify data anomalies, missing values, and schema deviations early in the pipeline. Having a well-defined schema and performing data validation ensures data consistency and helps catch potential issues before training models.

Model Versioning and Deployment: TFX facilitates model versioning and deployment. It helps manage different versions of trained models, tracks metadata, and supports model serving through platforms like TensorFlow Serving or Kubeflow. TFX also enables monitoring and managing the lifecycle of deployed models, including performance tracking and model updates.

Scalability and Portability: TFX leverages distributed processing frameworks like Apache Beam for scalable data processing. It enables running ML pipelines on different execution environments, such as local machines, cloud clusters, or on-premises infrastructure. TFX promotes portability and scalability of ML pipelines.

While Airflow, TF, Spark, and Beam are powerful tools on their own, TFX complements them by providing an integrated and standardized ML platform with additional features and components specifically designed for ML pipeline development and deployment. It simplifies the process, promotes best practices, and offers a cohesive environment for end-to-end ML development.

Please do not post AI-generated output here: it is forbidden on Stack Overflow. You'd better delete this before you get yourself into trouble: we take plagiarism seriously here. — tchrist, Jun 27 '23 at 23:15

Why we need TFX if we have airflow for orchestration

4 Answers4