1

I have an amazon glue crawler, which looks at a specific s3 location, containing avro files. I have a process which outputs files in a new subfolder of that location.

Once I manually run the crawler, the new subfolder will be seen as a new table in a database, and it will also be is query-able from Athena.

Is there a way I can automate the process, and call the crawler programatically, but only specifying that new subfolder, so that it doesn't have to scan the entire parent folder structure? I want to add tables to a databases, and not partitions to an existing table.

I was looking for a Python option, and I have seen indeed that one can do:

import boto3
glue_client = boto3.client('glue', region_name='us-east-1')
glue_client.start_crawler(Name='avro-crawler')

I haven't seen an option to pass a folder to limit where the crawler is looking into. Because there are hundreds of folders/tables in that location, re-crawling everything takes a long time, which I'm trying to avoid.

What are my options here? Would I need to programatically create a new crawler with each new subfolder added to s3?

Or create a lambda function which gets triggered when a new subfolder gets added to s3? I've seen an answer here , but even with lambda, it still implies I call the start_crawler, which would crawl everything?

Thanks for any suggestions.

cristi.calugaru
  • 571
  • 10
  • 22

1 Answers1

4

Update crawler_name to your crawler_name and update_path to your update path.

response = glue_client.update_crawler(Name=crawler_name,
                           Targets={'S3Targets': [{'Path':update_path}]})
sam
  • 1,819
  • 1
  • 18
  • 30
Kishore
  • 86
  • 1
  • 4
  • 2
    Thanks, that's exactly the path I got into. As a reference for others, one can use an existing crawler, you just need to update its S3 target paths. Once that's done, when you run start_crawler, it will use the new updated S3 paths. – cristi.calugaru Aug 21 '18 at 16:47
  • 1
    But will it return everything back to normal after it runs? – pavel_orekhov May 30 '19 at 15:05
  • @hey_you No. It's a permanent change until the crawler is updated again. – Asclepius Apr 14 '21 at 23:04