1

I am extracting data from AWS Athena using pyathena library and following function:

def import_ben_datalake(ACCESS_KEY, SECRET_KEY, S3_DIR, REGION, start, end):
    conn = pyathena.connect(aws_access_key_id = ACCESS_KEY, 
                            aws_secret_access_key = SECRET_KEY,
                            s3_staging_dir = S3_DIR,
                            region_name = REGION)
    sql = f"""SELECT columns
          FROM table
          WHERE column_datetime BETWEEN PARSE_DATETIME('{start.strftime("%Y-%m-%d")}', 'YYYY-MM-DD')
                                    AND PARSE_DATETIME('{end.strftime("%Y-%m-%d")}', 'YYYY-MM-DD')"""

    df = pd.read_sql(sql, conn)
    conn.close()

    return df

start and end params are datetime.date variable as:

start_test = datetime.date(2020, 11, 22)
end_test = datetime.date(2020, 11, 28)

Both are dates from November this year, but when I call the function it is returning all the values between 2020-Jan-22 and 2020-Jan-28.

Any help would be great to solve this issue!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
danimille
  • 350
  • 1
  • 12
  • 1
    Ugh, please don't be using string formatting to build your SQL queries. Not only is it likely the cause of your problem, but it's also open to SQL Injection. Look into parameterized queries – roganjosh Dec 28 '20 at 18:15
  • Dear @roganjosh, thanks for you advice. I have no idea how to use parameterized queries! Could you give me an example of how my query would be as a parameterized query? Best regards!! – danimille Dec 28 '20 at 18:21
  • Hmm. I've just found out that Amazon Athena [doesn't support parameterized queries](https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html) so it seems they won't help in this case. That seems very peculiar to me – roganjosh Dec 28 '20 at 18:43
  • In fact, it does support parameterized queries. I'll reply my post with correct parameterized query. – danimille Dec 28 '20 at 19:04

1 Answers1

2

Follow the example of parameterized query that solved my issue:

def import_ben_datalake(ACCESS_KEY, SECRET_KEY, S3_DIR, REGION, start, end):
    conn = pyathena.connect(aws_access_key_id = ACCESS_KEY, 
                            aws_secret_access_key = SECRET_KEY,
                            s3_staging_dir = S3_DIR,
                            region_name = REGION)
    sql = """SELECT columns
             FROM table
             WHERE column.datetime BETWEEN %(start)s AND %(end)s"""

   df = pd.read_sql(sql, conn, params = {"start": start, "end": end})
   conn.close()

   return df
danimille
  • 350
  • 1
  • 12
  • Nice find. It's interesting that the python binding gets around the limitations of the actual lack of parameterization [here](https://github.com/laughingman7743/PyAthena/blob/b52c11fc3f6d98e0a9f303491bd22d07dec5f9c6/pyathena/formatter.py#L62) so that it complies with [PEP 249](https://www.python.org/dev/peps/pep-0249/) – roganjosh Dec 28 '20 at 20:12