You might have figured it out already, but I thought I might as well post a possible solution to this.
The following code creates random data with dimensions (2,100) and tries to train a 128-mixture gmm using the EM_uniform algorithm:
import sidekit
import numpy as np
import random as rn
gmm = sidekit.Mixture()
data = np.array([[rn.random() for i in range(100)],[rn.random() for i in range(100)]])
gmm.EM_uniform(data,
distrib_nb=128,
iteration_min=3,
iteration_max=10,
llk_gain=0.01,
do_init=True)
However, this results in the same error as you have reported:
ValueError: operands could not be broadcast together with shapes (128,100) (128,0)
I suspect there is some bug in how gmm.invcov is calculated in Sidekit.Mixture._init_uniform(), so I have figured out a manual initialization of the mixture with code from Sidekit.Mixture._init() (the initialization function for the EM_split()-algorithm).
The following code ran without errors on my computer:
import sidekit
import numpy as np
import random as rn
import copy
gmm = sidekit.Mixture()
data = np.array([[rn.random() for i in range(100)],[rn.random() for i in range(100)]])
# Initialize the Mixture with code from Sidekit.Mixture._init()
mu = data.mean(0)
cov = (data**2).mean(0)
gmm.mu = mu[None]
gmm.invcov = 1./cov[None]
gmm.w = np.asarray([1.0])
gmm.cst = np.zeros(gmm.w.shape)
gmm.det = np.zeros(gmm.w.shape)
gmm.cov_var_ctl = 1.0 / copy.deepcopy(gmm.invcov)
gmm._compute_all()
# Now run EM without initialization
gmm.EM_uniform(data,
distrib_nb=128,
iteration_min=3,
iteration_max=10,
llk_gain=0.01,
do_init=False)
This gave the following output:
[-31.419146414931213, 54.759037708692404, 54.759037708692404, 54.759037708692404],
which is the log-likelihood values after each iteration (convergence after 4 iterations. Do note that this example data is way to small to train a gmm on.)
I cannot guarantee this leads to any errors later on, leave a comment if that is the case!
As for HDF5-files, check out the the h5py documentation for tutorials. Also, hdfview allows you to look into contents of the h5-files, which is pretty convenient for debugging later on when you get to scoring.