Redshift Spectrum Query - Request ran out of memory in the S3 query layer

Question

I am trying to execute a query with grouping on 26 columns. Data is stored in S3 in parquet format partitioned by day. Redshift Spectrum query is returning below error. I am not able to find any relevant documentation in aws regarding this.

Request ran out of memory in the S3 query layer

Total Number of rows in table : 770 Million
Total size of table in Parquet format : 45 GB
Number of records in each partition : 4.2 Million
Million Redshift configuration : Single node dc2.xlarge

Attached is the table ddl

please update your question with the ddl for the tables you are using, the number of rows approx and the sql that you are running. Also can you try the same/similar in Athena does that work? — Jon Scott, Oct 18 '18 at 06:40
do you actually need a grouping? It's hard to say without seeing your actual query or the data but it seems like your table schema already contains many columns such as event_at_timestamp and click_guid that aggregating using a grouping may be redundant because these values may be unique... — Nathan Griffiths, Oct 29 '18 at 22:17

Joe Harris · Accepted Answer · 2018-10-19T14:39:35.573

Try declaring the text columns in this table as VARCHAR rather than STRING. Also make sure to use the minimum possible VARCHAR size for the column to reduce the memory required by the GROUP BY.

Also, two further suggestions:

Recommend always using at least 2 nodes of Redshift. This gives you a free leader node and allows your compute nodes to use all their RAM for query processing.

Grouping by so many columns is an unusual query pattern. If you are looking for duplicates in the table consider hashing the columns into a single value and grouping on that. Here's an example:

SELECT MD5(ws_sold_date_sk
         ||ws_sold_time_sk
         ||ws_ship_date_sk
         ||ws_item_sk
         ||ws_bill_customer_sk
         ||ws_bill_cdemo_sk
         ||ws_bill_hdemo_sk
         ||ws_bill_addr_sk
         ||ws_ship_customer_sk
         ||ws_ship_cdemo_sk
         ||ws_ship_hdemo_sk
         ||ws_ship_addr_sk
         ||ws_web_page_sk
         ||ws_web_site_sk
         ||ws_ship_mode_sk)
     , COUNT(*)
FROM spectrum.web_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;

hey, great suggestions. Im just asking this question for my understanding, I played with redshift by allocating 1% memory to WLM queue and I ran a query, it was too slow but it worked without fail. Generally, if the memory is not enough, then the database server will use disk(even redshift does this), but why it is failing instead of using the disk? — TheDataGuy, Oct 28 '20 at 12:49

Redshift Spectrum Query - Request ran out of memory in the S3 query layer

1 Answers1