1

My s3 bucket has data partitioned by key. So my data will be at

s3://my-bucket/202001/tablenm/key=1/<data>
s3://my-bucket/202001/tablenm/key=2/<data>
s3://my-bucket/202001/tablenm/key=3/<data>
s3://my-bucket/202001/tablenm/key=4/<data>

I have a external athena table whose location is s3://my-bucket/202001/tablenm/

Now, for the next run, the data is loaded onto the same bucket but in a different month folder. So it would look something like this

s3://my-bucket/202002/tablenm/key=2/<data>
s3://my-bucket/202002/tablenm/key=3/<data>
s3://my-bucket/202002/tablenm/key=4/<data>
s3://my-bucket/202002/tablenm/key=5/<data>

I now want the table to show the data thats in s3://my-bucket/202002/tablenm/ folder. So, I then changed the location for the athena table as

alter table tablenm set location "s3://my-bucket/202002/tablenm/"

and after i did a msck repair on the table i get the error saying partition not in metastore : key=1, key=2, key=3, key=4

Apart from dropping and re-creating the athena table with the new location, is there a way to update the metadata to point to the new location partitions?

Riyan Mohammed
  • 247
  • 2
  • 6
  • 20
  • Typically in an Athena -- S3 arrangement you have glue acting as your service to manage your tables etc, have you checked that a crawler has run to update your glue tables and discover your new data? – pkarfs Apr 24 '20 at 03:21
  • Does this answer your question? [Does the partion location change automatically when the athena table's location is changed?](https://stackoverflow.com/questions/55181583/does-the-partion-location-change-automatically-when-the-athena-tables-location) – Theo Apr 27 '20 at 10:20

1 Answers1

4

Tables are just metadata, and are cheap resources. Don't worry about dropping and recreating them. In this situation it would be better to just drop the table and create a new one.

If you really want to avoid dropping the table, what you need to do is to remove all the partitions before or just after you change the table location.

The location of a table and the locations of its partitions unfortunately has nothing to do with each other, they are not linked. When you change the location of the table the partitions you had previously added are still associated with the table. Then, when you run the MSCK REPAIR TABLE command, it gets very confused since it's looking at the new table location, finds partitions, and when it cross-checks those with the existing partitions the metadata doesn't match.

The MSCK REPAIR TABLE, makes assumptions like that partitions have locations below your table's location. ALTER TABLE … SET LOCATION … meanwhile, does not consider the assumptions of MSCK REPAIR TABLE and just does exactly what you tell it to, set the location of the table, and does not also rewrite the location of its partitions. One of the reasons it doesn't do this is that it would be wrong thing to do in other circumstances – there is no rule that says that partitions must have locations that are related to the table's location. At least in Athena, the location of a partitioned table is mostly meaningless, but commands like MSCK REPAIR TABLE and Glue Crawlers use it as a hint to where they should start looking for data. The Athena team has also retrofitted a lot of commands from Presto/Hive to run on Glue instead of Hive metastore, which isn't a one-to-one fit, and some things probably just couldn't be done in a way that would work consistently.

You can see what happens if you use the Glue API to query your table before and after you change its location, use the GetPartitions API call to list the partitions of the table and you'll see that they don't change.


Also, although using MSCK REPAIR TABLE like you do works fine, just so you know, it's very, very slow when you get more partitions. Don't rely on it other than to load a few partitions. You can find more information here, here, here, here, and here.

Theo
  • 131,503
  • 21
  • 160
  • 205