0

I have manually provisioned a Glue Crawler and now am attempting to run it via Airflow (in AWS).

Based on the docs from here, there seems to be plenty of ways to handle this objective compared to other tasks within the Glue environment. However, I'm having issues handling this seemingly simple scenario.

The following code defines the basic setup for Glue[Crawler]+Airflow. Assume there are some other working tasks that are defined before and after it, which are not included here.

run_crawler = AwsGlueCrawlerHook()
run_crawler.start_crawler(crawler_name="foo-crawler")

Now, here is an example flow:

json2parquet >> run_crawler >> parquet2redshift

Given all this, the following error manifests on the Airflow Webserver UI:

Broken DAG: An error occurred (CrawlerRunningException) when calling the StartCrawler operation: Crawler with name housing-raw-crawler-crawler-b3be889 has already started

I get it: why don't you use something other than the start_crawler method...? Fair point, but I don't know what else to employ. I just want to start the crawler after some upstream tasks have successfully completed but am unable to.

How should I resolve this problem?

nate
  • 440
  • 2
  • 8
  • 18

2 Answers2

1

json2parquet >> run_crawler >> parquet2redshift

In Airflow, the bitwise right shift Python operator (>>) is used to define a downstream relationship between 2 operators (e.g. BaseOperator).

Declaring a DAG > Task Dependencies (Airflow)

run_crawler = AwsGlueCrawlerHook()
run_crawler.start_crawler(crawler_name="foo-crawler")

run_crawler (AwsGlueCrawlerHook) is not an operator. It is a subclass of BaseHook. The >> (and <<) Python operator can be used with objects that are a subclass of BaseOperator.

airflow.hooks.base
airflow.models.baseoperator

How should I resolve this problem?

run_crawler needs to be implemented as an operator (e.g. BaseOperator).

PythonOperator is a type of operator. The GlueCrawlerOperator is more feature-rich with respect to creating, updating, and running a Glue crawler. The operator executes idempotently. For example, if a crawler with the same name already exists, the operator will run it. Otherwise, it will create it.

GlueCrawlerOperator (Airflow)

Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
0

This issue came up because of my lack of Airflow knowledge. Using PythonOperator and encapsulating the functionality above in that object solved this issue. For example, a workable approach looks something like this:

def glue_crawler_parquet2redshift():
    run_crawler = AwsGlueCrawlerHook()
    return run_crawler.start_crawler(crawler_name="housingGlueCrawlerParquetRaw")

glue_crawler_parquet2redshift_task = PythonOperator(
    task_id='ingestHousingRawParquet',
    python_callable=glue_crawler_parquet2redshift,
    dag=housing_dag,
    )
nate
  • 440
  • 2
  • 8
  • 18