1

I have to select more than 1.9 billion rows. I am trying to query a table hosted in a DB in AWS ATHENA console. The table is reading parquet files from the a S3 bucket.

When I run this query:

SELECT * FROM ids WHERE org = 'abcd' AND idkey = 'email-md5';

My query seems to time-Out as there are 1.9 billion rows that are returned when I run a COUNT on it.

I tried OFFSET along with LIMIT but it doesn't seem to work in AWS Athena.

Also tried something on the lines

SELECT * FROM ids WHERE org = 'abcd' AND idkey = 'email-md5' LIMIT 0,500;

This doesn't seem to work as well.

Not sure how to chunk with such a large dataset using SELECT?

The aim here is to be able to query the entire dataset without having the query time out.

I ran a COUNT-

SELECT COUNT(*) FROM ids WHERE org = 'abcd' AND idkey = 'email-md5';

And the COUNT returned is 1.9 Billion as mentioned above. I need to pull all the 1.9 Billion rows so that i can then download it in and do further analysis.

Raj
  • 93
  • 1
  • 11
  • What is the purpose of extracting such a huge dataset? – GMB Dec 04 '19 at 23:10
  • Could you please edit your question to include details about how your data is stored? For example, what is the format of the files (CSV, Parquet, JSON)? Are they compressed? Are you using partitions? Are you in a situation where you can change some of these factors to make the queries more efficient (faster & cheaper)? – John Rotenstein Dec 04 '19 at 23:11
  • *How* are you running that query - from some code? What language/platform are you using? Are you using the Athena JDBC driver or some other connector? You need to include more information about what you're doing, but from what you've said so far I don't think SQL is going to help you - you'll possibly need to run a series of SQL statements in a loop and dynamically add filters to your query to return subsets of the data that your code/driver/network can handle. – Nathan Griffiths Dec 04 '19 at 23:32
  • @GMB Doing some data quality testing from our Datawarehouse so need to look at the entire dataset. Athena is used to query data from the Datawarehouse. – Raj Dec 05 '19 at 02:03
  • Hey @JohnRotenstein These are Parquet files that being read from a bucket in S3. I cannot mess around with the table as it is linked to our daily ETL process. – Raj Dec 05 '19 at 02:11
  • Hi @NathanGriffiths I am using the AWS Athena console to query this table. Thank you all for reviewing the question. I am learning and just added/edited to make my question more clear. – Raj Dec 05 '19 at 02:14
  • Ok that's clearer now - how are you planning to further analyse the data, assuming you were able to obtain a file (or files) containing all 1.9 billion rows? e.g. are you going to do some analysis with python or R? – Nathan Griffiths Dec 05 '19 at 02:45
  • @Raj In most of the cases the query time out happens because of the file sizes.Can you confirm what is the average size of each object in s3? – Prabhakar Reddy Dec 05 '19 at 04:55

1 Answers1

1

It appears that your situation is:

  • A daily ETL process provides new Parquet files on a daily basis
  • One table has 1.9 billion rows
  • Queries are timing-out in Athena

It would appear that your issue is related to Athena having to query so much data. Some ways to improve the efficient (and cost) of Athena are:

  • Use columnar-format files (you are using Parquet, so that is great!)
  • Compress the files (less to read from disk means it is faster and costs less for queries)
  • Partition the files (which allows Athena to totally skip files that aren't relevant)

The simplest one for your situation would probably be to start partitioning the data by putting the daily files into separate directories based upon something that is normally included in the WHERE statement. This would normally be dates, which is easy to partition (eg different directory per day or month), but might not be relevant given your filtering on org and idkey.

Another option would be transform the incoming files into a new table with relevant data. For example, you could create a table with a summary of the rows, such as a table that contains org, idkey and a count of those rows. Thus, multiple rows would be reduced to a single row within the file. This needs a better knowledge the content of the files and how you intend to query, but it would optimize those queries. Basically, you would process each day's new files into the computed table, then run queries against the computed table rather than the raw data. (Commonly known as an ETL process.)

A final suggestion would be to import the data into Amazon Redshift. It can handle billions of rows quite easily and can store the data in a compressed, optimized manner. This is only useful if you run lots of queries against the data. If you only run a few queries a day, then Athena would be a better choice.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470