0

While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow Code:

import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator

print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
    datetime.datetime.today() - datetime.timedelta(1),
    datetime.datetime.min.time())
default_dag_args = {
    # Setting start date as yesterday starts the DAG immediately when it is
    # detected in the Cloud Storage bucket.
    'start_date': yesterday,
    # To email on failure or retry set 'email' arg to your email and enable
    # emailing here.
    'email_on_failure': False,
    'email_on_retry': False,
    # If a task fails, retry it once after waiting at least 5 minutes
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': 'data-rubrics'
    #models.Variable.get('gcp_project')
}
try:
  # [START composer_quickstart_schedule]
  with models.DAG(
        'composer_agg_quickstart',
        # Continue to run DAG once per day
        schedule_interval=datetime.timedelta(days=1),
        default_args=default_dag_args) as dag:
    # [END composer_quickstart_schedule]
      op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
      #op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
      op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
      #op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
      op_start >> op_load
Gaurav Taneja
  • 1,084
  • 1
  • 8
  • 19

2 Answers2

0

UPDATE:

Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?


It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date

We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an @hourly DAG would never get to an hour after now as now() moves along.

Make your start_date static or use Airflow utils/macros:

import airflow
args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2),
}
kaxil
  • 17,706
  • 2
  • 59
  • 78
  • Understood, but the `op_start` (shared the same configs and hence the same start date) task runs fine but for `gcs2bq` there isn't any Task Instance generated at all – Gaurav Taneja Nov 19 '18 at 11:02
  • I have updated the answer, Can you remove `dag=dag` from `gcs2bq` task as you are already using with `models.DAG` and run your dag again? – kaxil Nov 19 '18 at 11:22
  • Removed `dag=dag` for `gcs2bq` but nothing changed – Gaurav Taneja Nov 19 '18 at 11:49
0

Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.

Gaurav Taneja
  • 1,084
  • 1
  • 8
  • 19