0

I am trying to Create an Athena Table that makes use of both Projected Partitioning and Bucketing (CLUSTERED BY). I'm doing this to get a side by side performance comparison for our dataset with and without using Bucketing. Through my tests, this does not seem to be supported. But I cannot find anything in the documentation that explicitly states this. So I'm assuming that I'm missing something. Bucketing works with normal Partitioning, but I'm trying to make use Projected Partitioning so that I do not have to maintain the Partitions in the Glue Catalog.

This my setup. I have an existing Athena Table that is setup to read Gzipped Parquet files on S3. This all works. In order to create the Bucketed version of my Table(s), I'm using Athena CTAS to create Bucketed Gzipped Parquet Files. The CTAS files are written to a staging location and then I move them to a location that fits my storage structure. I then try to create a new Table that points to the bucketed data and try to enable Projected Partitioning and Bucketing in the Table setup. I've used both Athena SQL and AWS Wrangler's create_parquet_table to do this.

Here is the original CTAS SQL that creates the Bucketed Files:

CREATE TABLE database_name.ctas_table_name
WITH (
   external_location = 's3://bucket/staging_prefix',
   partitioned_by = ARRAY['partition_column'],
   bucketed_by = ARRAY['index_column'],
   bucket_count = 10,
   format = 'PARQUET'
)
AS
SELECT 
   index_column, 
   partition_column
FROM database_name.table_name;

The files produced from the above CTAS are then moved from the staging location to the actual location, let's call it 's3://bucket/table_prefix'. This results in a s3 structure like:

s3://bucket/table_prefix/partition_column=xx/file.bucket_000.gzip.parquet s3://bucket/table_prefix/partition_column=xx/file.bucket_001.gzip.parquet ... s3://bucket/table_prefix/partition_column=xx/file.bucket_009.gzip.parquet

So 10 Bucketed file per partition.

Then the SQL to create the Table on Athena

CREATE TABLE database_name.table_name (
   index_column bigint,
   partition_column bigint
)
CLUSTERED BY (index_column) INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/table_prefix'
TBLPROPERTIES (
  'classification'='parquet', 
  'compressionType'='gzip', 
  'projection.enabled'='true',
  'projection.index_column.type'='integer',
  'projection.index_column.range'='0,3650',
  'typeOfData'='file'
);

If I submit this last CREATE TABLE SQL, it succeeds. However, when selecting from the table, I get the following error message:

HIVE_INVALID_BUCKET_FILES: Hive table 'database_name.table_name' is corrupt. Found sub-directory in bucket directory for partition: <UNPARTITIONED>

If I try to create the Table using the affore mentioned awswrangler.catalog.create_parquet_table, which looks like this:

response = awswrangler.catalog.create_parquet_table(
  boto3_session       = boto3_session,
  database            = 'database_name',
  table               = 'table_name',
  path                = 's3://bucket/table_prefix',
  partitions_types    = {"partition_column": "bigint"},
  columns_types       = {"index_column": "bigint", "partition_column": "bigint"},
  bucketing_info      = (["index_column"], 10),
  compression         = 'gzip',
  projection_enabled  = True,
  projection_types    = {"partition_column": "integer"},
  projection_ranges   = {"partition_column": "0,3650"},
)

This API call raises the following exception:

awswrangler.exceptions.InvalidArgumentCombination: Column index_column appears as projected column but not as partitioned column.

This does not make sense, as it clearly is there. I believe this to be a red herring in any case. If I remove the <bucketing_info> parameter, it all works. Inversely, if I remove the <projection...> parameters, it all works.

So from what I can gather, Partition Projection is not compatible with Bucketing. But this is not made clear in the documentation, nor could I find anything online to support this. So I'm asking here if anyone knows what is going on?

  • Did I make a mistake in my setup?
  • Did I miss a piece of AWS Athena documentation that states this is not possible.
  • Or is this an undocumented incompatibility?
  • Or... aliens??

0 Answers0