1

Our data is stored in S3 as JSON without partitions. Until today we were using only athena but now we tried Redshift Spectrum.

We are running the same query twice. Once using Redshift Spectrum and once using Athena. Both connect to the same data in S3.

Using Redshift Spectrum this report takes forever(more than 15 minutes) to run and using Athena it only takes 10 seconds to run.

The query that we are running in both cases in aws console is this:

SELECT "events"."persistentid" AS "persistentid",
  SUM(1) AS "sum_number_of_reco"
FROM "analytics"."events" "events"
GROUP BY "events"."persistentid"

Any idea what's going on? Thanks

AJ222
  • 1,114
  • 2
  • 18
  • 40
  • This is the difference between serverless and your servers. Redshift spectrum is using your Redshift cluster, which you can resize as needed, but it is probably smaller than the fleet of Athena that was assigned to your query. – Guy Nov 28 '19 at 12:27
  • The guys from AWS support say it's happening because we have many small files (as we are using Kineses Firehose and it is creating a file in s3 every 5 minutes)... – AJ222 Nov 28 '19 at 13:04

1 Answers1

5

The Redshift Spectrum processing power is limited by Redshift cluster size.

You can find the infomation from Improving Amazon Redshift Spectrum Query Performance

The Amazon Redshift query planner pushes predicates and aggregations to the Redshift Spectrum query layer whenever possible. When large amounts of data are returned from Amazon S3, the processing is limited by your cluster's resources. Redshift Spectrum scales automatically to process large requests. Thus, your overall performance improves whenever you can push processing to the Redshift Spectrum layer.

On the other hand, Athena uses optimized amount of resource for the query, which may be larger than the Spectrum of a small Redshift cluster can get.

This has been confirmed by our testing on Redshift Spectrum performance with different Redshift cluster size.

  • Note that the core processing of Redshift Spectrum operates at an independent layer than Redshift and is *not* affected by the Redshift cluster. `Improving Amazon S3 query performance with predicate pushdown The processing that is done in the Amazon Redshift Spectrum layer (the Amazon S3 scan, projection, filtering, and aggregation) is independent from any individual Amazon Redshift cluster.` https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ – Gabe Aug 09 '21 at 19:26