2

I have developed different Athena Workgroups for different teams so that I can separate their queries and their query results. The users would like to query the tables available to them from their notebook instances (JupyterLab). I am having difficulty finding code which successfully covers the requirement of querying a table from the user's specific workgroup. I have only found code that will query the table from the primary workgroup.

The code I have currently used is added below.

from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
region_name='<YOUR REGION, for example, us-west-2>')


df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
df

This code does not work as the users only have access to perform queries from their specific workgroups hence get errors when this code is run. It also does not cover the requirement of separating the user's queries in user specific workgroups.

Any suggestions on how I can add alter the code so that I can run the queries within a specific workgroup from the notebook instance?

Nicholas
  • 125
  • 2
  • 12
  • I am not familiar with `pyathena` but after a quick look I found parameter `work_group` which can be passed to `connect`. Have you tried to use it? – Ilya Kisil Aug 14 '19 at 11:05
  • @IlyaKisil No I haven't tried it.. I couldn't find anything online documentation wise. Would you be able to link me to where you found this? Cheers – Nicholas Aug 16 '19 at 13:41

2 Answers2

3

Documentation of pyathena is not super extensive, but after looking into source code we can see that connect simply creates instance of Connection class.

def connect(*args, **kwargs):
    from pyathena.connection import Connection
    return Connection(*args, **kwargs)

Now, after looking into signature of Connection.__init__ on GitHub we can see parameter work_group=None which name in the same way as one of the parameters for start_query_execution from the official AWS Python API boto3. Here is what their documentation say about it:

WorkGroup (string) -- The name of the workgroup in which the query is being started.

After following through usages and imports in Connection we endup with BaseCursor class that under the hood makes a call to start_query_execution while unpacking a dictionary with parameters assembled by BaseCursor._build_start_query_execution_request method. That is excatly where we can see familar syntax for submitting queries to AWS Athena, in particular the following part:

if self._work_group or work_group:
    request.update({
        'WorkGroup': work_group if work_group else self._work_group
    })

So this should do a trick for your case:

import pandas as pd
from pyathena import connect


conn = connect(
    s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
    region_name='<YOUR REGION, for example, us-west-2>',
    work_group='<USER SPECIFIC WORKGROUP>'
)

df = pd.read_sql("SELECT * FROM <DATABASE-NAME>.<YOUR TABLE NAME> limit 8;", conn)
Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
0

I implemented this it worked for me.

!pip install pyathena

Ref link

from pyathena import connect
from pyathena.pandas.util import as_pandas
import boto3


query = """
    Select * from "s3-prod-db"."CustomerTransaction" ct where  date(partitiondate) >= date('2022-09-30') limit 10
"""
query

cursor = connect(s3_staging_dir='s3://s3-temp-prod2/',
                 region_name=boto3.session.Session().region_name, work_group='data-scientist').cursor()
df = cursor.execute(query)
print(cursor.state)
print(cursor.state_change_reason)
print(cursor.completion_date_time)
print(cursor.submission_date_time)
print(cursor.data_scanned_in_bytes)
print(cursor.output_location)
df = as_pandas(cursor)
print(df)

If we dont pass work_group parameter will use "primary" as the default work_group.

If we pass s3_staging_dir='s3://s3-temp-prod2/' s3 bucket which does not exist, it will create this bucket.

But if the user role that you are running the script does not have to create bucket privilege it will throw an exception.

Gautam
  • 3,707
  • 5
  • 36
  • 57