Getting a "Disk Full" error from Redshift Spectrum

Question

I am facing the problem of frequent Disk Full error on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.

Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.

Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?

EDIT: As requested by Jon Scott, I am updating my question

SELECT p.postcode,
         SUM(p.like_count),
         COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
    ON l.postcode = p.postcode
GROUP BY 1;

The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor

EDIT 2: I used a 16 TB SSD cluster.

i do not think there is any caching on redshift spectrum (unless you are managing that yourself somehow) - can you elaborate your use case and provide examples? what tools are you using? — Jon Scott, Jun 26 '19 at 12:36
I am having two s3 folders that contained xz json files, mapped into two `external table`s. The total of zipped data is about 1.8 TB. I select two columns, each from a table and do a `SUM` function. I am using pure Redshift Spectrum. — Minh Triet, Jun 26 '19 at 14:42
please provide the exact sql you are using. and try the same in athena and come back with the results. (edit your question with this info please) — Jon Scott, Jun 26 '19 at 15:04

Joe Harris · Accepted Answer · 2019-07-01T14:07:41.970

You did not mention the size of the Redshift cluster you are using but the simple answer is to use a larger Redshift cluster (more nodes) or use a larger node type (more disk per node).

The issue is occurring because Redshift Spectrum is not able to push the full join execution down to the Spectrum layer. A majority of the data is being returned to the Redshift cluster simply to execute the join.

You could also restructure the query so that more work can be pushed down to Spectrum, in this case by doing the grouping and counting before joining. This will be most effective if the total number of rows output from each subquery is significantly fewer than the rows that would be returned for the join otherwise.

SELECT p.postcode
     , p.like_count
     , l.like_ids
FROM (--Summarize post data
      SELECT p.postcode
           , SUM(p.like_count)
      FROM post AS p 
      GROUP BY 1
     ) AS p
INNER JOIN (--Summarize likes data
            SELECT l.postcode
                 , COUNT(l.id) like_ids
            FROM likes AS l 
            GROUP BY 1
          ) AS l
    -- Join pre-summarized data only
    ON l.postcode = p.postcode
;

I also learnt that one should create a table that contain only the data they need, not a table that has anything in the JSON — Minh Triet, Jul 01 '19 at 07:50

Getting a "Disk Full" error from Redshift Spectrum

1 Answers1