Load data from Athena into Pandas dataframe in Cloud9/Lambda in AWS for ETL

Question

I am building a datalake in AWS. The source data is imported in S3 as CDCs. I need to find a way to merge them in order to have a table with the most recent version of information.

Originally I wanted to use Glue for ETL development but the editor seems rather clunky. Additonally the data volume is not so large that spark is necessary. Pandas will work too and has a much broader knowledge base in the organisation.

So I used Glue to crawl the imports and now have Athena tables that I want to develop my aggregations on in Cloud9 to later migrate to a Lambda function.

The problem is that I can't get the Athena data into a dataframe.

I have tried the start_query_execution function out of boto3 but it does not return data but only writes it into S3 which I don't want. It also return as QueryExecutionId which I have passed into another boto function called get_query_results. There seems to be a response but I struggle on how to pass the data into a dataframe (is it JSON or a dict?).

#python 3.6
import pandas as pd
import numpy as np
import boto3
import time

#https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/

#athena constant
DATABASE = 'myDatabase'
TABLE = 'myTable'

#output
S3_OUTPUT = 's3://myBucket/myPath/'

client = boto3.client('athena')

response = client.start_query_execution(
        QueryString='select * from myTable limit 100',
        QueryExecutionContext={
            'Database': DATABASE
        },
        ResultConfiguration={
            'OutputLocation': S3_OUTPUT,

        }
)

print(response["QueryExecutionId"])

time.sleep(50)

data = client.get_query_results(
    QueryExecutionId=response["QueryExecutionId"]
)

dataDf = pd.read_json(data["ResultSet"])
print(dataDf.head())

score 0 · Accepted Answer · answered Jun 13 '19 at 06:51

This worked for me. Download the file instead of using the JSON response.

import os
import boto3

s3 = boto3.client('s3')
bucket = 'myBucket'
key = 'myPath'

data_file_name = f'{response["QueryExecutionId"]}.csv'
object = os.path.join(key, data_file_name)
s3.download_file(bucket, object, data_file_name)
df = pd.read_csv(data_file_name)

Load data from Athena into Pandas dataframe in Cloud9/Lambda in AWS for ETL

1 Answers1