Is there a way to run aws glue crawler after job is finished?

Question

For example I run ETL and new fields or columns may be added for target table. To detect table changes a crawler should be run but it has only manual or schedule run.

Can crawler be triggered after job is finished?

score 11 · Accepted Answer · answered Jan 13 '18 at 19:04

11

import boto3
glue_client = boto3.client('glue', region_name='us-east-1')
glue_client.start_crawler(Name='name_of_crawler')

Copy this code snippet at the end of your code.

answered Jan 13 '18 at 19:04

Ashutosh

347
4
11

This is throwing connection time out error. Is there any alternative or solution for my error, please? ConnectTimeoutError: Connect timeout on endpoint URL: "https://glue.eu-central-1.amazonaws.com/" – Tula Aug 03 '20 at 06:30
This method works for me. I have a Glue job to convert daily partition from tsv to parquet and save to s3 with _SUCCESS marker file. After parquet saved in S3, in the same Glue job I then run the above code to run crawler to update the table in the catalog. – panc Jun 18 '23 at 00:28

Neil McGuigan · Answer 2 · 2021-04-01T03:23:26.240

You can, using a trigger, but not in the trigger UI :S

With a Glue Workflow: Add a Trigger to start a job, add a Job, add a Trigger for job success, add a Crawler for what is triggered

Or, using the CLI:

aws glue create-trigger --name myJob-success \
    --type CONDITIONAL \
    --predicate '{"Logical":"ANY","Conditions":[{"JobName":"myJob","LogicalOperator":"EQUALS","State":"SUCCEEDED"}]}' \
    --actions CrawlerName=myCrawler \
    --start-on-creation

or in CloudFormation:

Type: AWS::Glue::Trigger
Properties: 
  Name: job_success
  Type: CONDITIONAL
  Predicate: 
    Logical: ANY
    Conditions:
      - JobName: myJob
        LogicalOperator: EQUALS
        State: SUCCEEDED
  Actions: 
    - CrawlerName:myCrawler

score 0 · Answer 3 · answered Aug 08 '23 at 17:15

If you want to update glue data catalog table, you can use the below code in the job write in order update the table while writing the results.

    val dataSink = glueContext
      .getSink(
        connectionType = "s3",
        connectionOptions = JsonOptions(
          Map(
            "pat" -> outputPath,         
            "enableUpdateCatalog" -> true, // this value should be added
            "updateBehavior" -> "UPDATE_IN_DATABASE" // this value should be added
          )
        )
      )
      .withFormat(
        format = "parquet",
        options = JsonOptions(Map("useGlueParquetWriter" -> true)) // this value should be added
      )


dataSink.setCatalogInfo(catalogDatabase = databaseName, catalogTableName = tableName)

dataSink.writeDynamicFrame(frame = DynamicFrame(dataframe, glueContext))

Is there a way to run aws glue crawler after job is finished?

3 Answers3