0

The code below provides a complete reproducible example. My question is (mainly) on my function thetaMax(). This is a function which minimizes a log-likelihood of a psychometric process.

I am learning Python and doing so by translating my R functions to python. The code below works as expected. However, because I am learning Python, but my question is on style and quality.

The function thetaMax would run over hundreds of thousands of individuals and in my R code is efficient and is split among multiple cores. But, before thinking about parallel processing, my first goal is make the python code as fast and efficient as possible.

There are probably many parts of my function thetaMax that can be improved, but one aspect I am most concerned about is:

for i in range(0,len(x2)):
     result[i] = (gpcm(theta, d = d[i], score = x2[i], a = 1, D = D))

I think doing this as a loop is probably bad and could be improved with some form of vectorization. Below are the complete contents necessary to implement this code, and thanks to anyone willing to offer suggestions on how to improve the code.

import numpy as np
from scipy.stats import binom
from scipy.optimize import minimize

def prob3pl(theta, a, b, c, D = 1.7):
    result = c + (1 - c) / (1 + np.exp(-D * a * (theta - b)))
    return(result)


def gpcm(theta, d, score, a, D = 1.7):
    Da = D * a
    result = np.exp(np.sum(Da * (theta - d[0:score])))/np.sum(np.exp(np.cumsum(Da * (theta - d))))
    return(result)

d = np.array([[0, -1, .5, 1],[0,-.5,.2,1]])
a = np.array([1,1,1,1,1])
b = np.array([-1,.5,-.5,0,2])
c = np.array([0,0,0,0,0])
x = np.array([1,1,0,1,0,1,1])
indDichot = range(0,5,1)

def thetaMax(x, indDichot, a, b, c, D, d):
    x1 = x[indDichot]
    x2 = np.delete(x, indDichot)
    result = [0] * len(x2)
    def fn(theta):
        if(len(x1) > 0):
            p = prob3pl(theta, a, b, c, D = D)
            logDichPart = sum(np.log(binom.pmf(x1,1,p)))
        else:
            logPolyPart = 0
        if(len(x2) > 0):
            for i in range(0,len(x2)):
                result[i] = (gpcm(theta, d = d[i], score = x2[i], a = 1, D = D))
            logPolyPart = sum(np.log(result))
        else:
            logPolyPart = 0
        LL = -(logDichPart + logPolyPart)
        return(LL)
    out = minimize(fn, x0=0)
    return(out)


thetaMax(x,indDichot,a,b,c,D=1,d = d)
martineau
  • 119,623
  • 25
  • 170
  • 301
user350540
  • 429
  • 5
  • 17
  • I think that codereview stackexchange is a better fit for this question. A general performance tip is to use as many vectorized operations as possible. For example, `sum(np.log(...))` should be `np.log(...).sum()`. – hilberts_drinking_problem Apr 27 '20 at 23:43
  • 1
    Can you explain what the program is doing? Have you done any benchmarking? Can we run this ourselves? As an aside, variable and function names should generally follow the `lower_case_with_underscores` style. – AMC Apr 28 '20 at 00:46
  • @hilberts_drinking_problem. Both operations are equivalent and both vectorized – Mad Physicist Apr 28 '20 at 05:23
  • 1
    I’m voting to close this question because it belongs on https://codereview.staxkexchange.com – Mad Physicist Apr 28 '20 at 05:24
  • @MadPhysicist I see a 80x speedup from `np.arange(10**6).sum()` over `sum(np.arange(10**6))` (where `sum` is the Python's builtin sum). As far as I know, calling sum on `np.array` will not do any special dispatch to numpy, hence no vectorization. – hilberts_drinking_problem Apr 28 '20 at 05:29
  • 1
    @hilberts_drinking_problem. Sorry, I misread `sum` as `np.sum`, which OP uses correctly sometimes – Mad Physicist Apr 28 '20 at 05:37
  • @MadPhysicist makes sense, I mentioned this in a comment since it could come as a surprise for someone with R background. – hilberts_drinking_problem Apr 28 '20 at 05:41
  • Thank you all. I will move this to the appropriate forum as requested (wasn't aware of the codereview stackexchange). – user350540 Apr 28 '20 at 10:36
  • @AMC this program minimizes a likelihood function we use when scoring a student's overall test. When student's take a test, we have statistical estimates of the item difficulty for each test item. These are treated as fixed values and are the parameters 'a', 'b', 'c', 'd' in my function. Then we have how the student responded (correct/incorrect) denoted as a vector, in this case, 'x'. Given those two inputs, we then maximize the likelihood (or minimize in this case) to estimate the student's ability on a scale. This is the value returned called 'theta' – user350540 Apr 28 '20 at 10:39

0 Answers0