why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location

Question

WITH
( format = 'Parquet', 
parquet_compression = 'SNAPPY', 
external_location = '**s3://target_bucket_name/my_data**' 
) 
AS
WITH my_data_2 
AS 
    (SELECT * FROM existing_tablegenerated_data" limit 10)
SELECT *
FROM my_data_2;

Since it already has the bucket to store the data , why does Athena need the spill-bucket and what does it store there ?

score 7 · Accepted Answer · answered Feb 24 '21 at 13:33

Trino/Presto developer here who was directly involved in Spill development.

In Trino (formerly known as Presto SQL) the term "spill" refers to dumping on disk data that does not fit into memory. It is an opt-in feature allowing you to process larger queries. Of course, if all your queries require spilling, it's more efficient to simply provision a bigger cluster with more memory, but the functionality is useful when larger queries are rare.

Spilling involves saving temporary data, not the final query results. The spilled data is re-read back and deleted before the query completes execution.

score 2 · Answer 2 · answered Jun 02 '21 at 07:31

Athena uses Lambda functions to connect to External Hive data stores Because of the limit on Lambda function response sizes, responses larger than the threshold spill into an Amazon S3 location that you specify when you create your Lambda function. Athena reads these responses from Amazon S3 directly.

https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hive.html

why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location

2 Answers2