How does AWS Athena manage to load 10GB/s from s3? I've managed 230 mb/s from a c6gn.16xlarge

Question

When running this query on AWS Athena, it manages to query a 63GB Traders.csv file

SELECT * FROM Trades WHERE TraderID = 1234567

Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).

What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).

But in my experiments I only managed to get these data reads from S3 (which are still impressive):

InstanceType	Mb/s	Network Card Gigabits
t2.2xlarge	113	low
t3.2xlarge	140	up to 5
c5n.2xlarge	160	up to 25
c6gn.16xlarge	230	100

(that's megabytes rather than megabits)

I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?

Are you using AWS cli to test speed? While fast for many purposes, it's not the fastest possible answer. For instance, several processes all downloading a portion of the CSV using range requests and processing data in memory would work through the entire file faster than downloading using AWS's cli and then processing the file on EBS. — Anon Coward, Mar 16 '21 at 18:59
At the moment I'm just trying to download a 10GB file using the AWS cli without trying to process it at all. On further inspect though it looks like the AWS cli is not multithreaded very well — Nick, Mar 17 '21 at 09:11

score 0 · Answer 1 · answered Mar 16 '21 at 18:56

0

It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM

No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.

answered Mar 16 '21 at 18:56

Mark B

183,023
24
297
295

I think we worked that it can't be many boxes as large joins break when you load too many rows as the box runs out of RAM. There would also be too much network overhead to collate the results to do it in 6 seconds I would have thought. – Nick Mar 17 '21 at 08:53

How does AWS Athena manage to load 10GB/s from s3? I've managed 230 mb/s from a c6gn.16xlarge

1 Answers1