11

For example I run ETL and new fields or columns may be added for target table. To detect table changes a crawler should be run but it has only manual or schedule run.

Can crawler be triggered after job is finished?

Cherry
  • 31,309
  • 66
  • 224
  • 364

3 Answers3

11
import boto3
glue_client = boto3.client('glue', region_name='us-east-1')
glue_client.start_crawler(Name='name_of_crawler')

Copy this code snippet at the end of your code.

Ashutosh
  • 347
  • 4
  • 11
  • This is throwing connection time out error. Is there any alternative or solution for my error, please? ConnectTimeoutError: Connect timeout on endpoint URL: "https://glue.eu-central-1.amazonaws.com/" – Tula Aug 03 '20 at 06:30
  • This method works for me. I have a Glue job to convert daily partition from tsv to parquet and save to s3 with _SUCCESS marker file. After parquet saved in S3, in the same Glue job I then run the above code to run crawler to update the table in the catalog. – panc Jun 18 '23 at 00:28
0

You can, using a trigger, but not in the trigger UI :S

With a Glue Workflow: Add a Trigger to start a job, add a Job, add a Trigger for job success, add a Crawler for what is triggered

enter image description here

Or, using the CLI:

aws glue create-trigger --name myJob-success \
    --type CONDITIONAL \
    --predicate '{"Logical":"ANY","Conditions":[{"JobName":"myJob","LogicalOperator":"EQUALS","State":"SUCCEEDED"}]}' \
    --actions CrawlerName=myCrawler \
    --start-on-creation

or in CloudFormation:

Type: AWS::Glue::Trigger
Properties: 
  Name: job_success
  Type: CONDITIONAL
  Predicate: 
    Logical: ANY
    Conditions:
      - JobName: myJob
        LogicalOperator: EQUALS
        State: SUCCEEDED
  Actions: 
    - CrawlerName:myCrawler
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
0

If you want to update glue data catalog table, you can use the below code in the job write in order update the table while writing the results.

    val dataSink = glueContext
      .getSink(
        connectionType = "s3",
        connectionOptions = JsonOptions(
          Map(
            "pat" -> outputPath,         
            "enableUpdateCatalog" -> true, // this value should be added
            "updateBehavior" -> "UPDATE_IN_DATABASE" // this value should be added
          )
        )
      )
      .withFormat(
        format = "parquet",
        options = JsonOptions(Map("useGlueParquetWriter" -> true)) // this value should be added
      )


dataSink.setCatalogInfo(catalogDatabase = databaseName, catalogTableName = tableName)

dataSink.writeDynamicFrame(frame = DynamicFrame(dataframe, glueContext))
Amjad Tubasi
  • 384
  • 5
  • 7