1

I have a function written in python which does two procedures:

  1. Preprocessing: read in data from an array and compute some values that I will later need to prevent repeated computation
  2. Iterate and compute a 'summary' of the data at every stage and use this to solve an optimisation problem.

The code is as follows:

import numpy as np
def iterative_hessian(data, targets,
                            sketch_method, sketch_size, num_iters):
    '''
    Original problem is min 0.5*||Ax-b||_2^2
    iterative_hessian asks us to minimise 0.5*||S_Ax||_2^2 - <A^Tb, x> 
    for a summary of the data S_A
    '''

    A = data
    y = targets
    n,d = A.shape
    x0 = np.zeros(shape=(d,))
    m = int(sketch_size) # sketching dimension

    ATy = A.T@y
    covariance_mat = A.T.dot(A)

    for n_iter in range(int(num_iters)):
        S_A = m**(-0.5)*np.random.normal(size=(m, n))
        B = S_A.T.dot(S_A)
        z = ATy - covariance_mat@x0 + np.dot(S_A.T, np.dot(S_A,x0)) #
        x_new = np.linalg.solve(B,z)
        x0 = x_new  
    return np.ravel(x0)

In practise I do not use the S_A = m**(-0.5)*np.random.normal(size=(m, n)) line but use a different random transform which is faster to apply but in principle it is sufficient for the question. This code works well for what I need but I was wondering if there is a reasonable way to do the following:

  1. Instead of repeating the line S_A = m**(-0.5)*np.random.normal(size=(m, n)) for every iteration, is there a way to specify the number of independent random copies (num_iters - which can be thought of as between 10 and 30) of S_A that are needed prior to the iteration and scan through the input only once to generate all such copies? I think this would store the S_A variables in some kind of multi-dimensional array but I'm not sure how best to do this, or whether it is even practical. I have tried a basic example doing this in parallel but it is slower than repeatedly passing through the matrix.
  2. Suppose that I want to endow this function with more properties, for instance I want to return the average time taken on line x_new = np.linalg.solve(B,z). Doing this is straightforward - import a time module and put the code in the function, however, this will always time the function and perhaps I only want to do this when testing. An easy way around this is to create a parameter in the function definition time_updates = False and then have if time_updates == False: proceed as above else: copy the exact same code but with some timing functionality added. Is there a better way to do this which can perhaps use classes in Python?

My intention is to use this iteration on blocks of data read in from a file which doesn't fit into memory. Whilst it might be possible to store a block in memory, it would be convenient if the function only passed over that block once rather than num_iters times. Passing over the quantities computed , S_A, covariance_matrix etc, is fine however.

charl
  • 93
  • 7

0 Answers0