I have manually provisioned a Glue Crawler and now am attempting to run it via Airflow (in AWS).
Based on the docs from here, there seems to be plenty of ways to handle this objective compared to other tasks within the Glue environment. However, I'm having issues handling this seemingly simple scenario.
The following code defines the basic setup for Glue[Crawler]+Airflow. Assume there are some other working tasks that are defined before and after it, which are not included here.
run_crawler = AwsGlueCrawlerHook()
run_crawler.start_crawler(crawler_name="foo-crawler")
Now, here is an example flow:
json2parquet >> run_crawler >> parquet2redshift
Given all this, the following error manifests on the Airflow Webserver UI:
Broken DAG: An error occurred (CrawlerRunningException) when calling the StartCrawler operation: Crawler with name housing-raw-crawler-crawler-b3be889 has already started
I get it: why don't you use something other than the start_crawler
method...? Fair point, but I don't know what else to employ. I just want to start the crawler after some upstream tasks have successfully completed but am unable to.
How should I resolve this problem?