0

I'm trying to run a basic bigquery operator in Airflow (using Google's Composer) task which uses a user defined function (UDF).

The example comes from https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions and runs perfectly in BigQuery.

However when I upload to composer I get "Function not found: multiplyInputs..." See python script below.

The udf_config field of BigQueryOperator requires a list, so I defined my UDF as a list containing one string - not sure if this is correct as it is clearly not registering as a UDF

Any help would be highly appreciated.

import datetime
from airflow import models
from airflow.contrib.operators import bigquery_operator

yesterday = datetime.datetime.combine(datetime.datetime.today() -             
datetime.timedelta(1),
                                  datetime.datetime.min.time())
default_dag_args = {
                # Setting start date as yesterday starts the DAG 
immediately when it is
                # detected in the Cloud Storage bucket.
                'start_date': yesterday,
                # To email on failure or retry set 'email' arg to your 
email and enable
                # emailing here.
                'email_on_failure': False,
                'email_on_retry': False,
                # If a task fails, retry it once after waiting at least 
5 minutes
                'retries': 1,
                'retry_delay': datetime.timedelta(minutes=5),
                'project_id': 'vital-platform-791'
}

with models.DAG('udf_example',
                schedule_interval=datetime.timedelta(days=1),
                default_args=default_dag_args) as dag:

    table = 'udf_table'

    # flatten fe table
    task_id = table + '_fe'

    udf_config = ["""CREATE TEMPORARY FUNCTION multiplyInputs(x 
                  FLOAT64, y FLOAT64)
                  RETURNS FLOAT64
                  LANGUAGE js AS \"""
                  return x*y;
                  \""";
                  """]

    print udf_config

    query = """WITH numbers AS
              (SELECT 1 AS x, 5 as y
              UNION ALL
              SELECT 2 AS x, 10 as y
              UNION ALL
              SELECT 3 as x, 15 as y)
            SELECT x, y, multiplyInputs(x, y) as product
            FROM numbers"""

    print query

    query = query
    destination_table = 'vital-platform-791.alpha_factors. 
                          {table}_fe'.format(table=table)

    t_fe = bigquery_operator.BigQueryOperator(task_id=task_id,
                                          bql=query,                                         
                     destination_dataset_table=destination_table,
                                          use_legacy_sql=False,                                              
                     write_disposition='WRITE_TRUNCATE',
                                          udf_config=udf_config)
dsesto
  • 7,864
  • 2
  • 33
  • 50

2 Answers2

1

I'm a little confused by the example. It looks like you just need to merge the udf_config and query:

query = ""CREATE TEMPORARY FUNCTION multiplyInputs(x 
              FLOAT64, y FLOAT64)
              RETURNS FLOAT64
              LANGUAGE js AS \"""
              return x*y;
              \""";
              WITH numbers AS
          (SELECT 1 AS x, 5 as y
          UNION ALL
          SELECT 2 AS x, 10 as y
          UNION ALL
          SELECT 3 as x, 15 as y)
        SELECT x, y, multiplyInputs(x, y) as product
        FROM numbers;"""
Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
  • Thanks, this works well. The BigQueryOperator's explicit "udf_config" field mislead me somewhat as I assumed I had to populate it when using UDF's. – Richardt Benade REZCO May 28 '18 at 07:25
  • I think the question is for Standard SQL hence this answer should be accepted instead of Mine. Good work @ElliottBrossard – kaxil May 28 '18 at 12:02
1

Upload your UDF function in Google Cloud Storage and pass that to udf_config param.

For example:

Your UDF function is in gs://test-bucket/testfolder/udf.js

Then in you airflow dag use:

udf_gcs_path = "gs://test-bucket/testfolder/udf.js"

bigquery_operator.BigQueryOperator(task_id=task_id,
    bql=query,                                         
    destination_dataset_table=destination_table,
    use_legacy_sql=False,
    write_disposition='WRITE_TRUNCATE',
    udf_config=[{"resourceUri": udf_gcs_path}])

References:

kaxil
  • 17,706
  • 2
  • 59
  • 78
  • Thanks, looking at the documentation this seems to be the method for legacy sql, not standard sql? https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions – Richardt Benade REZCO May 28 '18 at 07:49