-1

When running this query on AWS Athena, it manages to query a 63GB Traders.csv file

SELECT * FROM Trades WHERE TraderID = 1234567

Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).

What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).

But in my experiments I only managed to get these data reads from S3 (which are still impressive):

InstanceType Mb/s Network Card Gigabits
t2.2xlarge 113 low
t3.2xlarge 140 up to 5
c5n.2xlarge 160 up to 25
c6gn.16xlarge 230 100

(that's megabytes rather than megabits)

I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?

Nick
  • 920
  • 1
  • 7
  • 21
  • Are you using AWS cli to test speed? While fast for many purposes, it's not the fastest possible answer. For instance, several processes all downloading a portion of the CSV using range requests and processing data in memory would work through the entire file faster than downloading using AWS's cli and then processing the file on EBS. – Anon Coward Mar 16 '21 at 18:59
  • At the moment I'm just trying to download a 10GB file using the AWS cli without trying to process it at all. On further inspect though it looks like the AWS cli is not multithreaded very well – Nick Mar 17 '21 at 09:11

1 Answers1

0

It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM

No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.

Mark B
  • 183,023
  • 24
  • 297
  • 295
  • I think we worked that it can't be many boxes as large joins break when you load too many rows as the box runs out of RAM. There would also be too much network overhead to collate the results to do it in 6 seconds I would have thought. – Nick Mar 17 '21 at 08:53