0

I want to normalize a function (for example, chi2.pdf of scipy) over a range of A to B. For example, chi2.pdf is normalized for its range of 0 to infinity and it's integral over that area is 1. In order to do that, I can calculate integral of function over A to B, and divide the function by that integral. I can implement this with the following code:

import numpy as np
from scipy.stats import chi2
from scipy.integrate import quad

A = 2
B = 4
df = 3

z = quad(chi2.pdf,A,B,args=(df,A)[0]

Quad passes arguments df as degree of freedom, and A as loc - I want my chi square function shifted by A for various reasons. Now that I have z I can define a new function :

def normalized_chi_2(x,df,A,z):
    y = chi2.pdf(x,df,A)/z
    return(y)

A quick check with integration again:

integral_chi2 = quad(normalized_chi_2,A,B,args=(df,A,z)[0]
print(integral_chi2)
>0.9999999999999999

Shows that I achieved my purpose. But having two functions, and calculating Z in the main is relatively unwieldy, so I figured I can define a new function and calculate Z inside that function.

def normalized_chi_1(x,df,A):
    z = quad(chi2.pdf,A,B,args=(df,A))[0]
    y = chi2.pdf(x,df,A) / z
    return(y)

Now when I do a quick integration again:

integral_chi1 = quad(normalized_chi_1,A,B,args=(df,A))[0]
print(integral_chi1)
>0.42759329552910064

I don't get 1, and I get value equal to the value of original, unnormalized chi2.pdf (the z above). Another problem is that normalized_chi_1 (that takes df and A, and calculates its own z) is very very slow. For example, method 2, where I calculate z outside the function and pass it into a next function takes ~0.07 seconds, while method 1, where I calculate z inside the function takes ~7.30 seconds. Hundred times slower.

ivan199415
  • 375
  • 3
  • 12
  • Instead of integrating `chi2.pdf`, you could use `chi2.cdf(B) - chi2.cdf(A)`. (The CDF is the integral of the PDF, after all.) – Warren Weckesser Oct 17 '21 at 04:41
  • @WarrenWeckesser That's brillaint! Also, since my chi2.pdf is centered around 2, I don't need to calculate chi2.cdf of A, but I just need chi2.cdf of B. I've tested it, and I get roughly 10% faster by just calling chi2.cdf(B,df,A) than computing integral through quad. Thanks so much! – ivan199415 Oct 17 '21 at 09:16

1 Answers1

1

quad probably runs a loop under the hood and each time your function is called it calls another quad for calculating z, all of this gets quite taxing. To test this out, I added a simple print statement with a counter to the original function.

count = 0
def normalized_chi_1(x,df,A):
    global count
    z = quad(chi2.pdf,A,B,args=(df,A))[0]
    print(f"calculating z {count}th time")
    count += 1
    y = chi2.pdf(x,df,A) / z
    return(y)

The output I got was

calculating z 0th time
...
calculating z 227th time
calculating z 228th time
calculating z 229th time
calculating z 230th time

So you're computing the integral for z about 230 times which more or less explains the 100x increase in execution time.

If you want a function to calculate z you can simply do

from functools import lru_cache

@lru_cache
def get_z(*ars,args):
    return quad(*ars,args)[0]

A = 2
B = 4
df = 3

def normalized_chi_1(x,df,A):  
    z = get_z(chi2.pdf,A,B,args=(df,A))
    y = chi2.pdf(x,df,A) / z
    return(y)

integral_chi1 = quad(normalized_chi_1,A,B,args=(df,A))[0]

This gave me the correct result and the running time of 0.07s but I think simply defining z in main is better.

kinshukdua
  • 1,944
  • 1
  • 5
  • 15
  • 1
    I figured it was looping, but could not process why it wasn't giving me a correct answer. Using memoization is pretty clever too. I'm going to say that using scipy's predefined distribution has a merit of including cdf. And I'll try to glue together your proposal and Warren's comment above. But thanks very much. – ivan199415 Oct 17 '21 at 09:13