3

I would like to use DBT in MWAA Airflow enviroment. To achieve this I need to install DBT in the managed environment and from there run the dbt commands via the Airflow operators or CLI (BashOperator).

My problem with solution is that I need store the dbt profile file(s) -which contains the target / source database credentials- in S3. Otherwise the file is not going to be deployed to the Airflow worker nodes hence cannot be used by dbt.

Is there any other option? I feel this is a big security risk and also undermines the use of Airflow (because I would like to use its inbuilt password manager)

My ideas:

  • Create the profile file on the fly in the Airflow dag as a task and write it out to local. I do not think this is a feasible workaround, because there is no guarantee that the dbt task is going to run on the same worker node which my code created.
  • Move the profile file manually to S3 (Exclude it from CI/CD). Again, I see a security risk, as I am storing credentials on S3.
  • Create a custom operator, which builds the profile file on the same machine as command will run. Maintenance nightmare.
  • Use MWAA environment variables (https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-env-variables.html) and combine it with dbt's env_var command. (https://docs.getdbt.com/reference/dbt-jinja-functions/env_var) Storing credentials in System wide EVs, this way feels awkward.

Any good ideas or best practices?

PeterRing
  • 1,767
  • 12
  • 20

2 Answers2

1

@PeterRing, in our case we use Dbt Cloud. Once the connection is set up in the Airflow UI, you are calling Dbt Job IDs to trigger the job (then using a sensor to monitor it until it completes).

If you can't use Dbt Cloud, perhas you can use AWS Secrets Manager to store your db profile/creds: Configuring an Apache Airflow connection using a Secrets Manager secret

Coffee and Code
  • 885
  • 14
  • 16
0

We ran into this problem as well. We solved it using a simple custom Airflow dbt operator in combination with MWAA's Secrets Manager backend and environment variables like Coffee and Code proposed.

Within our dbt project, we push a profiles.yml to S3, too. It looks like this:

dwh:
  outputs:
    prd:
      dbname: dwh
      host: "{{ env_var('DBT_HOST') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      port: 5439
      schema: "{{ env_var('DBT_SCHEMA', 'dbt') }}"
      threads: 1
      type: redshift
      user: "{{ env_var('DBT_USER') }}"
  target: prd

Then, within Airflow, we created a custom dbt operator. Whenever this operator inits, it gets the profiles information using the AWS Secrets Manager like this:

dbt_profile = json.loads(Variable.get("dbt-profile-env"))

The operator is basically a dbt wrapper for the standard BashOperator. Before it runs, it populates the profiles.yml like this:

bash_command=f"""
            cp -R /usr/local/airflow/dags/dbt /tmp;\
            export DBT_SCHEMA={dbt_profile['schema']};\
            export DBT_HOST={dbt_profile['host']};\
            export DBT_USER={dbt_profile['user']};\
            export DBT_PASSWORD={dbt_profile['password']};\
            /usr/local/airflow/.local/bin/dbt {command} --project-dir /tmp/dbt --profiles-dir /tmp/dbt;\   
            cat /tmp/logs/dbt.log
        """

Note the command is based on AWS' official MWAA User Guide & that you specifically point to the directory where the (now populated) profiles.yml is stored.

Bas Witte
  • 31
  • 1
  • 5