8

I've created a table with auto partitioning with this code on Athena.

CREATE EXTERNAL TABLE IF NOT EXISTS matchdata.stattable (
  `matchResult` string,
  ...
) PARTITIONED BY (
  year int ,
  month int,
  day int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://******/data/year=2019/month=8/day=2/'
TBLPROPERTIES ('has_encrypted_data'='false');

and I ran MSCK REPAIR TABLE stattable, but got Tables missing on filesystem and query result is zero records returned. matchdata.stattable gets same result.

Another table without partitioning, the query works fine. But as the service continues and dataset gets grow, I must go with partitioning.

The example data path is data/2019/8/2/1SxFHaUeHfesLtPs._BjDk.gz. How can I settle this issue?

Leta
  • 95
  • 1
  • 3
  • I made a table with location 's3://***/data/' again, but then I got **Partitions not in metastore**. I added a partition manually and tried again but auto partitioning with msck repair not works. – Leta Aug 07 '19 at 14:31

2 Answers2

6

As you've discovered (but with some more context for the people having the same issue) MSCK REPAIR TABLE … only understands Hive style partitioning, e.g. /data/year=2019/month=08/day=10/file.json. What the command really does is scan through the prefix on S3 corresponding to the table's LOCATION directive and look for path components that look like that.

This is just a limitation with MSCK REPAIR TABLE …, you can manually add partitions with other path styles like this:

ALTER TABLE the_table ADD PARTITION (year = '2019', month = '08', day = '10') LOCATION 's3://some-bucket/data/2019/08/10/'

Also see https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

I would go so far as to say that you should avoid using MSCK REPAIR TABLE … altogether. It's slow, and only gets slower the more partitions you have. It's much more efficient to run ALTER TABLE … ADD PARTITION … when you add new data on S3, because you know what you just added and where it is, so telling Athena to scan through your whole prefix is unnecessary. Even faster is using the Glue API directly, but that's more code, unfortunately.

Theo
  • 131,503
  • 21
  • 160
  • 205
0

I solved this problem by renaming s3 files' prefix.

You can't actually rename or move file in s3 directly. By mv command, you should create another key and delete existing one.

By run this code on console, you can make Hive can understand location of partition.

aws s3 --recursive mv s3://***/data/2019/8/7/ s3://***/data/year=2019/month=8/day=7/
Leta
  • 95
  • 1
  • 3