1

I am trying to load data from a CSV file in GCS, but there is no predefined operator that does this in Airflow.

I built a simple operator using a PSQL hook and a GCS file reader, but I'm wondering if there is a better solution for this, as right now the way the custom operator workes is running on a loop, row by row, a series of "INSERT INTO" statements with the open GCS file.

Poala Astrid
  • 1,028
  • 2
  • 10
johan855
  • 1,578
  • 4
  • 26
  • 51

2 Answers2

1

Yes there is no operator to insert data from GCS into CLoud SQL, but you can use the CloudSqlHook, to import the GCS file.

Here you find an example for body, which is a dict contains your file rows, if your file is too big, you can import it in batchs (1k-10K rows) which is much better than a loop with INSERT INTO.

Hussein Awala
  • 4,285
  • 2
  • 9
  • 23
0

You can use CloudSQLImportInstanceOperator on your use case. It imports data from Cloud Storage(CSV file) into a Cloud SQL instance. You can go to this link for an in-depth explanation about this operator.

import datetime

from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.operators.cloud_sql import CloudSQLImportInstanceOperator

# If you are running Airflow in more than one time zone
# see https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
# for best practices
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)

PROJECT_ID = "project-id"
DAG_ID = "cloudsql"
#BUCKET_NAME = f"{DAG_ID}_{ENV_ID}_bucket"
INSTANCE_NAME = "instance-name"

import_body = {
  "importContext": {
    "uri": "gs://bucket/file.csv",    
      "fileType": "CSV",
    "csvImportOptions": {
      "table": "table",
      "columns": [
        "column1",        
          "column2"              
      ]          
    },    
      "database": "guestbook",    
      "importUser": "postgres"      
  }
}

default_args = {
    'owner': 'Composer Example',
    'depends_on_past': False,
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'start_date': YESTERDAY,
}


with models.DAG(
        'composer_quickstart',
        catchup=False,
        default_args=default_args,
        schedule_interval=datetime.timedelta(days=1)) as dag:

    # Print the dag_run id from the Airflow logs
   sql_import_task = CloudSQLImportInstanceOperator(
    body=import_body, instance=INSTANCE_NAME, task_id='sql_import_task', project_id=PROJECT_ID)

sql_import_task 
Poala Astrid
  • 1,028
  • 2
  • 10