-2

I have a model for four possibilities of purchasing a pair items (purchasing both, none or just one) and need to optimize the (pseudo-) log-likelihood function. Part of this, of course, is the calculation/definition of the pseudo-log-likelihood function.

The following is my code, where Beta is a 2-d vector for each customer (there are U customers and U different beta vectors), X is a 2-d vector for each item (different for each of the N items) and Gamma is a symmetric matrix with a scalar value gamma(i,j) for each pair of items. And df is a dataframe of the purchases - one row for each customer and N columns for the items.

It would seem to me that all of these loops are inefficient and take up too much time, but I am not sure how to speed up this calculation and would appreciate any help improving it. Thank you in advance!

def pseudo_likelihood(Args):
    Beta = np.reshape(Args[0:2*U], (U, 2))
    Gamma = np.reshape(Args[2*U:], (N,N))
    L = 0
    for u in range(0,U,1):
        print datetime.datetime.today(), " for user {}".format(u)
        y = df.loc[u][1:]
        beta_u = Beta[u,:]
        for l in range(N):
            print datetime.datetime.today(), " for item {}".format(l)
            for i in range(N-1):
                if i == l:
                    continue
                for j in range(i+1,N):
                    if (y[i] == y[j]):
                        if (y[i] == 1):
                            L += np.dot(beta_u,(x_vals.iloc[i,1:]+x_vals.iloc[j,1:])) + Gamma[i,j] #Log of the exponent of this expression
                        else:
                            L += np.log(
                                1 - np.exp(np.dot(beta_u, (x_vals.iloc[i, 1:] + x_vals.iloc[j, 1:])) + Gamma[i, j])
                                - np.exp(np.dot(beta_u, x_vals.iloc[i, 1:])) * (
                                            1 - np.exp(np.dot(beta_u, x_vals.iloc[j, 1:])))
                                - np.exp(np.dot(beta_u, x_vals.iloc[j, 1:])) * (
                                            1 - np.exp(np.dot(beta_u, x_vals.iloc[i, 1:]))))
                    else:
                        if (y[i] == 1):
                            L += np.dot(beta_u,x_vals.iloc[i,1:]) + np.log(1 - np.exp(np.dot(beta_u,x_vals.iloc[j,1:])))
                        else:
                            L += (np.dot(beta_u, x_vals.iloc[j,1:])) + np.log(1 - np.exp(np.dot(beta_u, x_vals.iloc[i,1:])))

            L -= (N-2)*np.dot(beta_u,x_vals.iloc[l,1:])
            for k in range(N):
                if k != l:
                    L -= np.dot(beta_u, x_vals.iloc[k,1:])

    return -L

To add/clarify - I am using this calculation to optimize and find the beta and gamma parameters that generated the data for this pseudo-likelihood function.
I am using scipy optimize.minimize with the 'Powell' method.

sa_zy
  • 331
  • 1
  • 10

1 Answers1

0

Updating for whomever is interested- I found numpy.einsum to speed up the calculations here by over 90%.

np.einsum performs matrix/vector operations using Einstein notation. Recall that for two matrices A, B their product can be represented as the sum of

a_ij*b_jk

i.e. the ik element of the matrix AB is the sum over j of a_ij*b_jk

Using the einsum function I could calculate in advance all of the values necessary for the iterative calculation, saving precious time and hundreds, if not thousands, of unnecessary calculations. I rewrote the code as follows:

def pseudo_likelihood(Args):
    Beta = np.reshape(Args[0:2*U], (U,2))
    Gamma = np.reshape(Args[2*U:], (N,N))
    exp_gamma = np.exp(Gamma)
    L = 0
    for u in xrange(U):
        y = df.loc[u][1:]
        beta_u = Beta[u,:]
        beta_dot_x = np.einsum('ij,j',x_vals[['V1','V2']],beta_u)
        exp_beta_dot_x = np.exp(beta_dot_x)
        log_one_minus_exp = np.log(1 - exp_beta_dot_x)
        for l in xrange(N):
            for i in xrange(N-1):
                if i == l:
                    continue
                for j in xrange(i+1,N):
                    if (y[i] == y[j]):
                        if (y[i] == 1):
                            L += beta_dot_x[i] + beta_dot_x[j] + Gamma[i,j] #Log of the exponent of this expression
                        else:
                            L += math.log(
                                1 - exp_beta_dot_x[i]*exp_beta_dot_x[j]*exp_gamma[i,j]
                                - exp_beta_dot_x[i] * (1 - exp_beta_dot_x[j])
                                - exp_beta_dot_x[j] * (1 - exp_beta_dot_x[i]))
                    else:
                        if (y[i] == 1):
                            L += beta_dot_x[i] + log_one_minus_exp[j]
                        else:
                            L += (beta_dot_x[j]) + log_one_minus_exp[i]

            L -= (N-2)*beta_dot_x[l]
            for k in xrange(N):
                if k != l:
                    L -= sum(beta_dot_x) + beta_dot_x[l]

    return -L
sa_zy
  • 331
  • 1
  • 10
  • Since python is notoriously slow with for loops you might also benefit considerably if you can convert some of the loops to functions that can run inherently in machine code. I.e. take a look at the `numba` library and add the decorator `@jit(nopython=True)` – Attack68 Nov 16 '18 at 19:39
  • @Attack68 Thank you very much; unfortunately numba does not support pandas and doesn't work with my code – sa_zy Nov 19 '18 at 11:16
  • yes I'm aware, the point I made (rather unclearly) was that instead of trying to just place the decorator at the top of your function and hope it works, because it won't, try to sub analyse your function and decompose it into blocks where sub-loops (only using NumPy) can be extracted and reformulated. Although saying that you only use a pandas dataframe in one line and that can easily be converted into a NumPy array. – Attack68 Nov 19 '18 at 21:51
  • I see. Thank you, I did not understand that. I will see if I can isolate numpy blocks of code and add the decorator appropriately. Will update tomorrow – sa_zy Nov 21 '18 at 06:47