0

I am trying to execute a query with grouping on 26 columns. Data is stored in S3 in parquet format partitioned by day. Redshift Spectrum query is returning below error. I am not able to find any relevant documentation in aws regarding this.

Request ran out of memory in the S3 query layer

  1. Total Number of rows in table : 770 Million
  2. Total size of table in Parquet format : 45 GB
  3. Number of records in each partition : 4.2 Million
  4. Million Redshift configuration : Single node dc2.xlarge

Attached is the table ddl enter image description here

conetfun
  • 1,605
  • 4
  • 17
  • 38
  • please update your question with the ddl for the tables you are using, the number of rows approx and the sql that you are running. Also can you try the same/similar in Athena does that work? – Jon Scott Oct 18 '18 at 06:40
  • 1
    what query are you running against that? – Jon Scott Oct 18 '18 at 21:09
  • do you actually need a grouping? It's hard to say without seeing your actual query or the data but it seems like your table schema already contains many columns such as event_at_timestamp and click_guid that aggregating using a grouping may be redundant because these values may be unique... – Nathan Griffiths Oct 29 '18 at 22:17

1 Answers1

2

Try declaring the text columns in this table as VARCHAR rather than STRING. Also make sure to use the minimum possible VARCHAR size for the column to reduce the memory required by the GROUP BY.

Also, two further suggestions:

  1. Recommend always using at least 2 nodes of Redshift. This gives you a free leader node and allows your compute nodes to use all their RAM for query processing.

  2. Grouping by so many columns is an unusual query pattern. If you are looking for duplicates in the table consider hashing the columns into a single value and grouping on that. Here's an example:

    SELECT MD5(ws_sold_date_sk
             ||ws_sold_time_sk
             ||ws_ship_date_sk
             ||ws_item_sk
             ||ws_bill_customer_sk
             ||ws_bill_cdemo_sk
             ||ws_bill_hdemo_sk
             ||ws_bill_addr_sk
             ||ws_ship_customer_sk
             ||ws_ship_cdemo_sk
             ||ws_ship_hdemo_sk
             ||ws_ship_addr_sk
             ||ws_web_page_sk
             ||ws_web_site_sk
             ||ws_ship_mode_sk)
         , COUNT(*)
    FROM spectrum.web_sales
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10
    ;
    
Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • hey, great suggestions. Im just asking this question for my understanding, I played with redshift by allocating 1% memory to WLM queue and I ran a query, it was too slow but it worked without fail. Generally, if the memory is not enough, then the database server will use disk(even redshift does this), but why it is failing instead of using the disk? – TheDataGuy Oct 28 '20 at 12:49