3

Python allows me to get percentiles 0 through 100 in steps of 1 for a list of values, as follows:

import numpy as np

a = np.array([1,2,3,4,5,6,7,8,9,10])
np.percentile(a,np.arange(0,101,1),interpolation='higher')

Result:

array([ 1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,
        5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  8,
        8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  9,  9,  9,  9,  9,  9,  9,
        9,  9,  9,  9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

I am trying to achieve the same result with Athena SQL, but I have not yet been successful.
Although I have tried using NTILE, I am probably missing a key concept:

-- sample data
WITH dataset (val) AS (
    VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
)

select *,NTILE(101) OVER(ORDER BY val) calculated_ntile
    from dataset

Is there a way to get Athena SQL results that are exactly equivalent to Python results?

Update: I've written the following query:

WITH dataset (val) AS (
    VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
)

 SELECT approx_percentile(val, ARRAY[0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99] ) from dataset

That outputs almost what I need:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

What do I need to modify?

lemon
  • 14,875
  • 6
  • 18
  • 38
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186

2 Answers2

1

Since there where no clear definition of the different implementations of percentile calculations, I ended up writing a python equivalent to Athena's approx_percentile:

def approx_percentile(a,p):
  a = np.sort(a)
  n = a.shape[0]
  rank = (p / 100) * n
  return a[0 if math.ceil(rank) == 0 else math.ceil(rank)-1]
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
  • Your question was about calculating percentiles using the interpolation='higher', extracting data with Athena... not about translating the `approx_percentile` to python. If you are happy with that method of percentiles, why didn't you stay on Athena? – nferreira78 Aug 24 '22 at 16:04
0

You are trying to rewrite a SQL statement to give you the result from a specific method in numpy.percentile.

You can refer to numpy.percentile docs, but rather than re-writing (translating python to sql) which could retrieve different results on different data, I would suggest you use the aws python sdk (boto3) to connect with Athena, collect the data and then process it with numpy.percentile:

import boto3

client = boto3.client('athena', region_name='us-west-2')

QUERY_STRING = '''WITH dataset (val) AS (
    VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
)
select *,NTILE(101) OVER(ORDER BY val) calculated_ntile
    from dataset'''

response = client.start_query_execution(
    QueryString=QUERY_STRING,
)

Then use the response to calculate your final results:

results = np.percentile(response, np.arange(0,101,1), interpolation='higher')

If you need, you can then update your Athena database using the python sdk again with your results.

nferreira78
  • 1,013
  • 4
  • 17
  • thank you for you reply, however, the data I'm trying to process is very large, that is why I want to do the calculations in Athena then get the results to the python side. – Shlomi Schwartz Aug 23 '22 at 06:00
  • numpy can also process a very large amount of data, as long as you have enough memory. re-writing the `numpy.percentile` with the method you want, it will require you a the same path of testing to guarantee the same results – nferreira78 Aug 23 '22 at 08:49
  • Here are the tests from `numpy.percentile`, which would need to be followed to guarantee repeatability: https://github.com/numpy/numpy/blob/86a368f53ba15a610be4c6cb3a173446aef3aa80/numpy/lib/tests/test_function_base.py However, this is quite an unnecessary effort imho. As you can see, it is just not worth it. As long as you run this code from a big enough EC2, this is the quickest and more reliable path. – nferreira78 Aug 23 '22 at 09:40