4

I have a process that stores data to S3, transforms the data and converts the data to Parquet, to be queried through Redshift Spectrum. I have a Glue crawler that crawls my dataset, and I use three partitions: year, month, day. All my files are stored like this:

<bucket>/<folder>/<folder>/<folder>/year=2018/month=8/day=20

I have data from 2015 up until last day, and that gives me just over 1300 partition keys.

Here is the problem. Since for a couple of days ago I started seeing this message from the crawler:

INFO : Folder partition keys do not match table partition keys, skipped folder: <bucket>/<folder>/<folder>/<folder>/year=2018/month=8/

The consequence of this is that August 2018 returns no data when querying this month, and that of course is very unfortunate.

Since all my data is stored in the same structure as part of the same ETL-process, and nothing in the process up until crawling fails, Im very puzzled by why the crawler all of a sudden starts skipping the last month (month=8). I have checked and checked to see if there is any difference between the table partitions and the folder partitions for month=8, but I can't find anything.

This is a longshot, but does anyone have any input to why this might occur?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jørgen Frøland
  • 364
  • 3
  • 13
  • How the source location is configured in crawler? Does it point to ////? Did you try to delete existing table and allow crawler to recreate it? Or just configure crawler to create tables in another db so that you could compare both schemas. – Yuriy Bondaruk Aug 23 '18 at 01:03
  • 1
    The crawler points to ///. I have tried to delete the table and recrawl. Same behavior. Since yesterday, I have discovered the last three days of data is creating the problem. If I exclude all data before these three days, it works as expected. And if I crawl the last three days isolated, it works as expected. I then created two crawlers, one that crawls the data before the problematic date, and another that crawls the data after. It doesnt solve the root problem, but at least I dont loose data now. And I will start comparing the schemas. – Jørgen Frøland Aug 23 '18 at 07:10
  • I have also seen this video: https://www.youtube.com/watch?v=GObs0r6yOPo and Senior Product Manager, AWS Glue Prajakta Damle explains how schemas are merges based on similarity. If similarity is 70% or higher it merges the schemas. The thing that puzzles me here, is that the crawler skips the folder all together. At least I hoped that it created a new table with another schema. – Jørgen Frøland Aug 23 '18 at 07:11
  • There's a comment at the end of one of the pages about crawlers that it scans the first 2 MB of files and determines the schema based on that. It might be that this somehow affects also partition recovery.. – LauriK Aug 23 '18 at 07:21
  • Did you try to enable 'Create a single schema for each S3 path.' option in your crawler when you configure it multiple paths (before and after 3 days)? – Yuriy Bondaruk Aug 23 '18 at 11:28
  • Yes, I have tried that. But then I get duplicate columns which gives me errors when running my SQL-query. And if I remove the duplicates Im get another error message when querying the data. I dont remember the last error message, tho. – Jørgen Frøland Aug 23 '18 at 12:35
  • Hi, did you find a solution to this issue or determine a cause? I have noticed that a Crawler skipping some partitions. We have bucket/folder/year=####/month=##/day=##. The crawler is pointed to bucket/folder and we returned partitions for many days and then suddenly it'll skip a day... – TechNewbie Aug 03 '21 at 20:42
  • @TechNewbie Hey. No I haven't figured out a solution. Its been a while now, and I have moved on to other things. Sorry. – Jørgen Frøland Nov 28 '21 at 20:20

0 Answers0