I have created a table on AWS Athena. It is partitioned both on S3, and Athena. I am now trying to load the table into a pandas dataframe using 2 methods from the awswrangler library: AWS Athena read_sql_query vs reading parquet directly as below:
start = time.time()
df = wr.athena.read_sql_query(sql="SELECT * FROM rumorz_db.ohlcv_1d_partition where coin='ADA'",
database="rumorz_db",
boto3_session=s3.session)
print(time.time() - start)
start = time.time()
current_data = parquet_util.readParquet(
file_path=s3.getObjectURI(RUMORZ_DATA_S3_BUCKET, f"market_data/1d_ohlcv_raw.parquet/"),
partition_filter=lambda x: x["coin"] == 'ADA'
)
print(time.time() - start)
The AWS Athena method takes 6 seconds, while read_parquet takes 2 seconds. I thought Athena was faster than reading parquet directly, not 3 times slower. Is this expected?