2

We have an existing infrastructure where we are crawling the S3 directories through aws crawlers. These S3 directories are created as part of AWS datalake and dumped through the spark job. Now in order to implement the delta feature, we were doing a POC on deltalake. So when I wrote these deltalake files in the S3 through our spark-delta Jobs, my crawlers are not able to create tables from these crawlers.

Can we crawl delta lake files using AWS crawlers ?

user3199285
  • 177
  • 2
  • 12
  • I believe delta lake files are nothing but parquet file. Can you verify if the crawler 's IAM role has read permissions on this files and also when you are writing to S3 make sure that you are giving bucket owner control if delta lake doesn't own the bucket https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-s3-acls.html – Prabhakar Reddy Sep 05 '20 at 17:08
  • Yepp that is verified. Also I do understand that these s3 files are parquet. But what happens when I tried to crawl these files, the tables are getting created without the table name. Now I checked these tables in Athena, bam there were 0 rows for these delta files. and hence the issue. you can try this yourself. – user3199285 Sep 05 '20 at 17:21
  • What does show create table statement output contain? Location is pointed to a folder or a file? If it is a file then you need to crawl these parquet files by keeping them in separate folders and then pass the parent path to crawler which will create tables with different schema and location pointed to folder instead of a file – Prabhakar Reddy Sep 05 '20 at 18:37

2 Answers2

2

As per this doc you should not be using Glue crawler.You should be using manifest files to integrate delta files with Athena.

Warning

Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

Prabhakar Reddy
  • 4,628
  • 18
  • 36
0

Glue Crawler recently released Delta Lake integration in 2022 where it will parse the Delta transaction log to gather the latest snapshot of the Delta table. It will then create manifest files and create an entry to the Glue Data Catalog which is query-able via Athena or Redshift Spectrum. The table created by the Delta Lake Crawler is also compatible with Lake Formation Cell Level security.

When creating a Delta Lake Crawler, make sure you specify a Delta Target in the console rather than an S3 Target. The crawler can be scheduled and will automatically detect schema evolution in your Delta Lake tables and populate them in the Glue Data Catalog and update any new partitions that it discovers.