Is crawler required for creating an AWS glue job?

Question

I'm learning Glue with Pyspark by following this page: https://aws-dojo.com/ws8/labs/configure-crawler/.

My question is: is crawler & creating a database in Lake Formation required for creating a glue job?

I have some issue with my aws role and I'm not authorised to create resourse in LakeFormation, so I'm thinking if I can skip them to only create a glue job and test my script?

For example, I only want to test my pyspark script for one single input .txt file, I store it in S3, do I still need crawler? Can I just use boto3 to create a glue job to test the script and do some preprocessing and write data back to s3?

score 5 · Accepted Answer · answered Feb 07 '21 at 16:56

No. you don't need to create a crawler to run Glue Job.

Crawler can read multiple datasources and keep Glue Catalog up to date. For example, when you have partitioned data in S3, as new partitions(folders) are created, we can schedule a crawler job to read those new S3 partitions and update metadata in Glue Catalog/tables.

Once Glue Catalog is updated with metadata, we can easily read actual data(behind these glue catalog/tables) using these Glue ETL or Athena or other processes.

In your case, you directly want to read S3 files and write them back to S3 in a Glue job, so, you don't need to a crawler or Glue Catalog.

Is crawler required for creating an AWS glue job?

1 Answers1