S3 bucketing creates more buckets than Hive specified and the redundant files are meta-text but not data

Question

I am migrating some tables, and want to re-bucket the table to exactly 1024 buckets per day. However, I saw that on some days, there are more than 1024 files in S3, and when open those files, which are all the same and significantly smaller than other 1024 files, I only saw some texts (the schema and some other meta information).

How did those files get generated (or not have been deleted)? What can I do to prevent from this from happening?

These are some codes and the S3 listing of the files. Hopefully it helps to explain my problem.

CREATE EXTERNAL TABLE IF NOT EXISTS {{params.table_schema}}.{{params.table_name}}
LIKE default.event_xxx_table
LOCATION '{{macros.xxx.s3_location(params.table_schema, params.table_name)}}';
ALTER TABLE {{params.table_schema}}.{{params.table_name}}
CLUSTERED BY (user_id) SORTED BY (occurred_at) INTO 1024 BUCKETS


set hive.enforce.bucketing = true;
    SET hive.mapred.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.execution.engine=tez;

    INSERT OVERWRITE TABLE {{params.table_schema}}.{{params.table_name}} PARTITION (ds)
    SELECT * 
    FROM default.event_xxx_table
    WHERE ds = '{date}'



2019-08-28 06:25:24        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001017
2019-08-28 06:25:24        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001018
2019-08-28 06:25:24        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001019
2019-08-28 06:25:25        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001020
2019-08-28 06:25:25        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001021
2019-08-28 06:25:25        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001022
2019-08-28 06:25:26        729 ed36387ffa4f4d3da4005fe7d21ec1c9_001023
2019-08-28 06:12:28 1017766687 edab90777b2b4146b1dba09ffeed5f99_000000
2019-08-28 06:11:52 1034071139 edab90777b2b4146b1dba09ffeed5f99_000001
2019-08-28 06:12:09 1127037910 edab90777b2b4146b1dba09ffeed5f99_000002
2019-08-28 06:11:41 1108365317 edab90777b2b4146b1dba09ffeed5f99_000003
2019-08-28 06:12:12 1177903335 edab90777b2b4146b1dba09ffeed5f99_000004
2019-08-28 06:12:05 1098575486 edab90777b2b4146b1dba09ffeed5f99_000005
2019-08-28 06:11:46 1114437323 edab90777b2b4146b1dba09ffeed5f99_000006
2019-08-28 06:11:30 1083373949 edab90777b2b4146b1dba09ffeed5f99_000007
2019-08-28 06:12:18 1177453908 edab90777b2b4146b1dba09ffeed5f99_000008
2019-08-28 06:12:25 1258655862 edab90777b2b4146b1dba09ffeed5f99_000009
2019-08-28 06:12:03 1068495135 edab90777b2b4146b1dba09ffeed5f99_000010
2019-08-28 06:12:04 1144533991 edab90777b2b4146b1dba09ffeed5f99_000011
2019-08-28 06:12:08 1198343555 edab90777b2b4146b1dba09ffeed5f99_000012
...
    2019-08-28 06:12:08 1198343555 edab90777b2b4146b1dba09ffeed5f99_001024

When querying the new table with Presto, I got error:

Query failed (#20190828_212540_16755_g3z3r): Hive table 'xxx.xxxx_xxx_bucketed' is corrupt. The number of files in the directory (1035) does not match the declared bucket count (1024) for partition: ds=2019-08-03

S3 bucketing creates more buckets than Hive specified and the redundant files are meta-text but not data

0 Answers0