1

I am building an index of user-qualities defined as a sum of (often) correlated continious variables representing user-activity. The index is well-calibrated, and servs the purpos of my analysis, but is tricky to communicate to my co-workers, particularly, since outlier activities cause extremely tenatious users to score a very highly on the activity index.

For 97% of users, the index is distributed near-normally between 0 and 100, with a right tail of 3% of hyper-active users with an index > 100. Index-values beyond 200 should be extremely rare but are theoretically possible.

I'm looking to scale the tail back into a 0-100 span, but not linearly, since I would like the 3%-tail to be represented as small variances within the top-range of the 0-100 index. What I'm looking for a non-linear formula to scale my index, like this:

enter image description here

so that the lower tier of the unscaled index remains close to the scaled one, but where high index-values diverge, but where scaled values never reach 100 as my index goes towards infinity, so that x=0=f(x) but when x = 140, f(x) ≈ 99 or something similar

I'll implement the scaling in R, Python and BigQuery.

nJGL
  • 819
  • 5
  • 17

1 Answers1

2

There are lots of ways to do this: take any function with the right shape and tweak it to your needs.

One family of functions with the right shape is

f(x) = x/pow(1 + pow(x/100, n), 1/n)

You can vary the parameter n to adjust the shape: increasing n pushes f(100) closer to 100. With n=5 you get something that looks pretty close to your drawing

f(x) = x/pow(1 + pow(x/100, 5), 0.2)

enter image description here

Another option is taking the hyperbolic tangent function tanh which you can of course tweak in similar ways:

f(x) = 100*pow(tanh(pow(x/100, n)), 1/n)

here's the curve with n=2:

enter image description here

Joni
  • 108,737
  • 14
  • 143
  • 193
  • Splendid. Extraordinarily pedagogical answer! It is easy enough to tweak the exponents to alter the intensity of the cut-off. I'll want most reduction to happen after x>90 so this will absolutely do the job.`TANH()` is perhaps not BigQuery's most used function, but it exists. – nJGL Aug 31 '20 at 08:53