1

I've got some rollup data that I create every night and store in an S3 bucket, partitioned by date. I execute an ALTER TABLE foo ADD PARTITION... to add each new partition to Athena as it's created. I've been able to verify that this successfully adds the data and that I can query it in Athena. So far, so good.

What I'd like to do now is have data older than 30 days automatically roll off. I can set up a lifecycle rule to drop the old data from S3; will this also automatically remove it from Athena, or do I need to take direct action in Athena itself to remove it from the table as well?

It seems to work as I expect when I simply remove the partition in S3, but I can't find anything definitive that says this is the recommended way to handle this.

Joe McMahon
  • 111
  • 5

3 Answers3

0

IIRC we use Glue Crawler to rescan the S3 and recreate the Athena tables every night. Not 100% sure if it's needed, maybe not. It doesn't hurt though :)

MLu
  • 24,849
  • 5
  • 59
  • 86
0

After some experimentation, Athena will not drop references to objects deleted by S3 operations, or at least not immediately -- it's possible that "eventual consistency" will fix the problem at some point, but if you're expecting it to happen in the short term you need to do it yourself. You must use ALTER TABLE to DROP the partitions if you really want them to go away.

I verified this by uploading a file multiple times under different names and deleting all but one. A COUNT(*) query showed that the records were still visible to Athena within a few minutes of the deletion, but a DROP PARTITION / ADD PARTITION operation cleared them immediately.

I would therefore not count on regular S3 lifecycle management to take care of Athena as well.

Joe McMahon
  • 111
  • 5
0

Check out this page. It shows you how to filter and delete partitions in batches using BOTO3 or the CLI.

As mentioned dropping files from s3 does not remove them from Glue Catalogue Metadata. As well removing them from glue does not remove them from S3. Both operations must be done independently.

Tinman
  • 1
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/526239) – Dave M Jul 29 '22 at 17:50