0

I have created a table on AWS Athena. It is partitioned both on S3, and Athena. I am now trying to load the table into a pandas dataframe using 2 methods from the awswrangler library: AWS Athena read_sql_query vs reading parquet directly as below:

start = time.time()
df = wr.athena.read_sql_query(sql="SELECT * FROM rumorz_db.ohlcv_1d_partition where coin='ADA'",
                              database="rumorz_db",
                              boto3_session=s3.session)
print(time.time() - start)
start = time.time()
current_data = parquet_util.readParquet(
    file_path=s3.getObjectURI(RUMORZ_DATA_S3_BUCKET, f"market_data/1d_ohlcv_raw.parquet/"),
    partition_filter=lambda x: x["coin"] == 'ADA'
)
print(time.time() - start) 

The AWS Athena method takes 6 seconds, while read_parquet takes 2 seconds. I thought Athena was faster than reading parquet directly, not 3 times slower. Is this expected?

  • 2
    How big is your data? Athena still has to spin up resources to handle your files. No sense using it for things that you can query directly or with S3 select. – Marcin Jul 29 '22 at 02:59
  • This specific data is 30mb so very small - was just using it for testing. Do you mean that of those 6 seconds, maybe 4 seconds were for spinning up resources, and only 2 for query handling? AKA, with larger sizes the Athena query handling efficiency will be more evident? (the 4 seconds of spin up will be very small compared to query handling time). – survival_law00 Jul 29 '22 at 13:03

0 Answers0