I have a process that stores data to S3, transforms the data and converts the data to Parquet, to be queried through Redshift Spectrum. I have a Glue crawler that crawls my dataset, and I use three partitions: year, month, day. All my files are stored like this:
<bucket>/<folder>/<folder>/<folder>/year=2018/month=8/day=20
I have data from 2015 up until last day, and that gives me just over 1300 partition keys.
Here is the problem. Since for a couple of days ago I started seeing this message from the crawler:
INFO : Folder partition keys do not match table partition keys, skipped folder: <bucket>/<folder>/<folder>/<folder>/year=2018/month=8/
The consequence of this is that August 2018 returns no data when querying this month, and that of course is very unfortunate.
Since all my data is stored in the same structure as part of the same ETL-process, and nothing in the process up until crawling fails, Im very puzzled by why the crawler all of a sudden starts skipping the last month (month=8). I have checked and checked to see if there is any difference between the table partitions and the folder partitions for month=8, but I can't find anything.
This is a longshot, but does anyone have any input to why this might occur?