I am facing the problem of frequent Disk Full error
on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.
Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.
Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?
EDIT: As requested by Jon Scott, I am updating my question
SELECT p.postcode,
SUM(p.like_count),
COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
ON l.postcode = p.postcode
GROUP BY 1;
The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor
EDIT 2: I used a 16 TB SSD cluster.