1

I wanted to ask the AWS community a question.

I recently shifted to Athena, and have the following observation:

It takes much more time to query data using pyathena (python client) than doing it straight in athena. I have a database of customer data, and when I execute a query in athena, it takes less than 60 secs to get the data, but when I execute the same query in Pyathena, it takes about 40 mins to do the same job.

Here is my python query:

cnxn = connect(s3_staging_dir='URL Address for my Athena results',region_name='us-east-2')
        sql= ''' SELECT * from some query '''

        df= pd.read_sql(sql, cnxn)

Can someone help me understand why this happens? Am i doing anything wrong?

Thank you

----EDITED----

I am running the query in Sagemaker. I am executing the query in Sagemaker's Jupyter env.

Kshitij Yadav
  • 1,357
  • 1
  • 15
  • 35
  • How many rows are being returned? From where is the query being run (eg EC2 instance or somewhere on the Internet)? When you "execute a query in Athena", are you doing it in the console (where it just displays a few rows) or doing it via an API call to Athena (which returns all the data)? I'm trying to see whether the delay is caused by transferring results rather than actually using Athena. – John Rotenstein Nov 13 '19 at 22:57
  • Try to add `limit 10` to your query – shuvalov Nov 14 '19 at 04:59
  • @JohnRotenstein I am running the query in Sagemaker. I am executing the query in Sagemaker's Jupyter env. – Kshitij Yadav Nov 14 '19 at 14:54
  • If you temporarily change the query to only return a few rows (eg `LIMIT 10`), does that speed things up? If so, then the delay appears to be related to the amount of data being returned and how the notebook loads it into memory. – John Rotenstein Nov 14 '19 at 19:56

1 Answers1

0

pd.read_sql will fetch the data piece by piece once a new one gets available which is super inefficient. Instead, you want to fully download the file first and then upload it into Python session. pyathena package implemented a special type of cursor for that: https://pypi.org/project/PyAthena/#pandascursor

Kreol
  • 207
  • 2
  • 7