0

I'm new to Google Cloud Composer and have run into what seems to be a strange issue in the DAG that I've created. I have a process which takes a tar.gz file from cloud storage, rezips it as a .gz file and then loads the .gz file to BigQuery. Yesterday, I tried to add a new step in the process which is an insert from the created "shard" to a new table.

I couldn't get this to work until I changed the order of steps in my DAG execution. In my DAG, I have a step called "delete_tar_gz_files_op". When this was executed prior to "insert_daily_file_into_nli_table_op," the insert never ran (no failure in Composer, just seemed to not run at all). When I swap the order of these two steps with no other changes to the code, the insert works as expected. Does anyone know what might cause this? I have no idea why this would happen as these two steps aren't related at all. The one does an insert query from one big query table to another. The other deletes a tar.gz file that's in cloud storage.

My dag execution order currently which works:

initialize >>  FilesToProcess >> download_file >> convert_task >> upload_task >> gcs_to_bq >> archive_files_op >> insert_daily_file_into_nli_table_op >> delete_tar_gz_files_op

Some of the code used:

    #The big query operator inserts the files from the .gz file into a table in big query.
    gcs_to_bq = GoogleCloudStorageToBigQueryOperator(
    task_id='load_basket_data_into_big_query'+job_desc,
    bucket="my-processing-bucket",
    bigquery_conn_id='bigquery_default',
    create_disposition='CREATE_IF_NEEDED',
    write_disposition='WRITE_TRUNCATE',
    compression='GZIP',
    source_objects=['gzip/myzip_'+process_date+'.gz'],
    destination_project_dataset_table='project.dataset.basket_'+clean_process_date,
    field_delimiter='|',
    skip_leading_rows=0,
    google_cloud_storage_conn_id="bigquery_default",
    schema_object="schema.json",
    dag=dag
    )

    #The created shard is then inserted into basket_raw_nli.basket_nli.  This is a partitioned table which contains only the NLI subtype
    insert_daily_file_into_nli_table_op = bigquery_operator.BigQueryOperator(
    task_id='insert_daily_file_into_nli_table_op_'+job_desc,
    bql=bqQuery,
    use_legacy_sql=False,
    bigquery_conn_id='bigquery_default',
    write_disposition='WRITE_APPEND',
    allow_large_results=True,
    destination_dataset_table=False,
    dag=dag)

    #The tar file created can now be deleted from the raw folder
    delete_tar_gz_files_op=python_operator.PythonOperator(
    task_id='delete_tar_gz_files_'+job_desc,
    python_callable=delete_tar_gz_files,
    op_args=[file, process_date],
    provide_context=False, 
    dag=dag)

    def delete_tar_gz_files(file, process_date):
        execution_command='gsutil rm ' + source_dir + '/' + file     
        print(execution_command)
        returncode=os.system(execution_command)
        if returncode != 0:
            #logging.error("Halting process...")
            exit(1)

Manual run status: run status

Majobber
  • 35
  • 5
  • Apologies, I've edited my post to include the code for this function as well as an image of the status. The status of all of the runs was "successful". – Majobber Aug 31 '18 at 10:53
  • I think it would also be interesting to see the DAG graph – tobi6 Aug 31 '18 at 11:02

0 Answers0