3

I have my cloud composer environment created under project-A and I want to load data into other project-B BigQuery table. I know about task its GCSToBigQueryOperator but its not succeeding its failing, I want to know how can I achieve this. From project A i want to run a task which loads data into project B table.

user3065757
  • 475
  • 1
  • 5
  • 14

2 Answers2

2

Based on my experience and assumptions of your condition, I think you need to make sure that your service account (bigquery_conn_id and google_cloud_storage_conn_id) has sufficient permissions in both projects.

shankshera
  • 947
  • 3
  • 20
  • 45
0

As shankshera mentioned first check in GCP IAM that the service account you use in cloud composer environment has access to both projects (and datasets in BigQuery).

To be honest I also couldn't make this operator work properly for me so I wrote custom python function to do the same thing.

    from google.cloud import bigquery
def load_into_table_from_csv(**kwargs):
    
    """
    Loads data into specified BQ table from specified CSV file in GCS
    
    Receives parameters from table_path and file_path from PythonOperator in Airflow. 
    Parameters need to be explicitly specified in op_kwargs variable in the task definition

Example of op_kwargs for PythonOperator:
{'table_path':'project_id.dataset_id.table_id',
'file_path':'gs://bucket_name/file_name.csv',
'delimiter':',' ,
'quote_character':'"'}

    """
    bigquery_client = bigquery.Client()
    dataset_ref = kwargs['table_path']
    try: 
        file=eval(kwargs["file_path"])
    except:
        file=kwargs["file_path"]
    finally:
        delimiter=kwargs["delimiter"]
        quote_character=kwargs["quote_character"]

        job_config = bigquery.LoadJobConfig()
        job_config.field_delimiter = delimiter #delimeter in the source file
        job_config.skip_leading_rows = 1 #how many rows to skip (set to 1 if you have a header row)
        job_config.quote_character=quote_character
        job_config.write_disposition ='WRITE_TRUNCATE' #https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad.FIELDS.write_disposition

        load_job = bigquery_client.load_table_from_uri(
            file,
            dataset_ref,
            job_config=job_config)

        assert load_job.job_type == 'load'

        load_job.result()  # Waits for table load to complete.

        assert load_job.state == 'DONE'

Than in the dag itself you can just use this function and provide parameters for it like so:

t8 = PythonOperator(
    task_id=f"load_{table_name}",
    python_callable=load_into_table_from_csv, #function that's called by the task
    op_kwargs=specs_current_table, #passing arguments into a function
    dag=dag
)

As a side note, I personally agree with author of this article https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753 and we should be careful with using to many custom operators if we can do the same thing with vanilla code