Is there an efficient way of representing a 2D numpy array for the purpose of fitting a GMM to it?

Question

I have been using Gaussian Mixture Models (GMM) to model a set of peaks in a 2D numpy array (a).

a = np.array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 100., 1000., 100., 2., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
              [0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 100., 100., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
              [0., 0., 2., 1., 2., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.],
              [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0.],
              [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

The problem is that in order to fit a GMM to my data with sklearn I have to first generate a density_array, which holds a huge amount of data points depending on the height of the peaks in a.

def convert_to_density_array(array):
    """
    Convert an array to a density array
    """
    density_list = []
    # iterate over each i,j coordinate in the array
    for (i, j), value in np.ndenumerate(array):
        for x in range(int(value)):
            density_list.append((i, j))
    return np.array(density_list)

density_array = convert_to_density_array(a)
gmm = mixture.GaussianMixture(n_components=2,covariance_type='full').fit(density_array)

Is there an efficient way of representing a 2D numpy array for the purpose of fitting a GMM to it?

It is not a more effective method, but it gives the same result as yours and is simpler: `np.repeat(np.argwhere(a), a[a != 0].astype(int), axis=0)` — Mechanic Pig, Sep 14 '22 at 13:02
Can you elaborate on what the `density_array` is and why it is required? — André, Sep 14 '22 at 14:23

Ahmed AEK · Accepted Answer · 2022-09-14T13:18:40.500

you can store data using less precision by adding dtype=np.float32 to your np.array call, which is okay as long as you are fine with 8 digits of precision instead of 15 (which is totally acceptable in your case), but that's the only way to store the same data in memory in less footprint and still pass it to gmm.

what you are trying to do is curve fitting, not data modelling , so you can use scipy curve fit on your original data without making density_array to start with, you just have to pass it a function of two gaussians and in a loop change the initial estimate randomly until you get the least error, but as writing the code for it will take some time, consider this approach only if you cannot get your data in memory using any other method.

Thank you for the answer, it really helps. I am uncertain if I get how scipy_curve_fit would solve my problem, though. Can you point me to more resources or elaborate your answer? — pietro_molina, Sep 15 '22 at 15:22

Is there an efficient way of representing a 2D numpy array for the purpose of fitting a GMM to it?

1 Answers1