-1

I have a dataset in the form of a table:

Score   Percentile
 381         1
 382         2
 383         2
      ...
 569        98
 570        99

The complete table is here as a Google spreadsheet.

Currently, I am computing a score and then doing a lookup on this dataset (table) to find the corresponding percentile rank.

Is it possible to create a function to calculate the corresponding percentile rank for a given score using a formula instead of looking it up in the table?

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
Michael C
  • 195
  • 6
  • 4
    Welcome to SO. Please provide a Minimal, Complete, and Verifiable example. **Show us the code for your latest attempt** and where you got stuck. and **explain why the result is not what you expected**. https://stackoverflow.com/help/mcve – Dragonthoughts Aug 29 '18 at 13:55

2 Answers2

1

It's impossible to recreate the function that generated a given table of data, if no information is provided about the process behind that data.

That being said, we can make some speculation.

Since it's a "percentile" function, it probably represents the cumulative value of a probability distribution of some sort. A very common probability distribution is the normal distribution, whose "cumulative" counterpart (i.e. its integral) is the so called "error function" ("erf").

In fact, your tabulated data looks a lot like an error function for a variable whose average value is 473.09:

enter image description here

your dataset: orange; fitted error function (erf): blue

However, the agreement is not perfect and that could be because of three reasons:

  1. the fitting procedure I've used to generate the parameters for the error function didn't use the right constraints (because I have no idea what I'm modelling!)
  2. your dataset doesn't represent an exact normal distribution, but rather real world data whose underlying distribution is the normal distribution. The features of your sample data that deviate from the model are being ignored altogether.
  3. the underlying distribution is not a normal distribution at all, its integral just happens to look like the error function by chance.

There is literally no way for me to tell!

If you want to use this function, this is its definition:

import numpy as np
from scipy.special import erf
def fitted_erf(x):
    c = 473.09090474
    w =  37.04826334
    return 50+50*erf((x-c)/(w*np.sqrt(2)))

Tests:

In [2]: fitted_erf(439) # 17 from the table
Out[2]: 17.874052406601457

In [3]: fitted_erf(457) # 34 from the table
Out[3]: 33.20270318344252

In [4]: fitted_erf(474) # 51 from the table
Out[4]: 50.97883169390196

In [5]: fitted_erf(502) # 79 from the table
Out[5]: 78.23955071273468

however I'd strongly advise you to check if a fitted function, made without knowledge of your data source, is the right tool for your task.


P.S.

In case you're interested, this is the code used to obtain the parameters:

import numpy as np
from scipy.special import erf
from scipy.optimize import curve_fit

tab=np.genfromtxt('table.csv', delimiter=',', skip_header=1)
# using a 'table.csv' file generated by Google Spreadsheets
x = tab[:,0]
y = tab[:,1]

def parametric_erf(x, c, w):
    return 50+50*erf((x-c)/(w*np.sqrt(2)))

pars, j = curve_fit(parametric_erf, x, y, p0=[475,10])

print(pars)
# outputs [  473.09090474,   37.04826334]

and to generate the plot

import matplotlib.pyplot as plt

plt.plot(x,parametric_erf(x,*pars))
plt.plot(x,y)
plt.show()
Community
  • 1
  • 1
Nicola Sap
  • 208
  • 2
  • 11
  • Thanks for this very detailed answer, it is exactly what I am looking for. After posting this question, I started digging around and realized that I needed to fit a model over this data but I couldn't figure out which kind (parabolic regression worked for the curves by themselves but not the data as a whole). As for the data itself, I should have given more details. The data is the national percentile rank for a standardized test (a cdf I think). I have some local results and I want to see in which percentile they fall nationally. – Michael C Aug 29 '18 at 17:33
  • Since your dataset is based on real world data, it can't be described with an analytical formula (other than a straightforward interpolation). However this kind of data (results of a test with a large number of possible scores) is probably described well by a normal distribution, and since your sampling is based on a whole country, the statistical data is expected to follow the underlying "ideal" distribution pretty well. – Nicola Sap Aug 29 '18 at 18:22
  • The decision is yours: you can use my fit, but be aware that it isn't the same thing as using the lookup table! Instead of saying "you are in the xth percentile compared to the rest of the country" you are saying "you are in the xth percentile of a normal distribution based on the results of the rest of the country" – Nicola Sap Aug 29 '18 at 18:22
  • Yes, thanks for the clarification. For production, I'm still using the tables but for my own personal learning and experimentation, the fitted model. Thank you for including the parameter selection in your answer. – Michael C Aug 30 '18 at 19:04
0

Your question is quite vague but it seems whatever calculation you do ends up with a number in the range 381-570, is this correct. You have a multiline calculation which gives this number? I'm guessing you are repeating this in many places in your code which is why you want to procedurise it?

For any calculation you can wrap it in a function. For instance:

answer = variable_1 * variable_2 + variable_3

can be written as:

def calculate(v1, v2, v3):
    ''' calculate the result from the inputs
    '''
    return v1 * v2 + v3

answer = calculate(variable_1, variable_2, variable_3)

if you would like a definitive answer then simply post your calculation and I can make it into a function for you

Stephen Ellwood
  • 394
  • 1
  • 2
  • 11
  • Thanks Stephen. I have the dataset (the lookup table) but I don't have the formula behind it. So I'm trying to find the formula using the data in order to put the formula into a function. This is probably more of a math question than a programming one. I just found this [question](https://math.stackexchange.com/questions/11502/find-formula-from-values) which seems to be relevant. – Michael C Aug 29 '18 at 14:21
  • Are you looking for the formula to lookup the answer in the dataset? if so, simply put the dataset into a dictionary and pull out the result directly. – Stephen Ellwood Aug 29 '18 at 15:30