2

I'm using sklearn.mixture.GMM to fit some data and am having trouble sampling from the GMM for one item in the dataset.

In over 1000 instances of the data it works fine, but in the case below (data_not_working) I get an error when running the following code:

from sklearn import mixture
import numpy as np 

data_not_working = np.array([[-13.3669, -0.152287, -0.926697, 0.0967975, 0.375109, 0.22213, 0.364592, 0.283643, 0.614218, -0.117485, 0.221134, 0.104302], [-7.32323, -0.515594, -0.864193, 0.102628, 0.32041, 0.0606005, 0.197593, 0.025868, 0.249107, -0.0754152, 0.0994283, 0.0511292], [-5.70166, -0.408034, -1.22175, 0.220845, 0.2968, 0.0308518, 0.013137, -0.672265, -0.180614, -0.231932, -0.141483, 0.318216], [-3.84773, -0.13171, -1.37403, 0.242801, 0.399666, -0.150793, -0.342479, -0.689551, -0.246872, 0.00635363, 0.148948, 0.221603], [-3.12773, 0.172297, -1.38291, 0.00240961, 0.475504, 0.18957, -0.593592, -0.378285, -0.195662, -0.10973, 0.369654, 0.143974], [-2.43561, 0.0644245, -0.95012, 0.289466, 0.292279, -0.0631116, -0.546317, -0.138747, -0.104671, -0.0917557, 0.101156, -0.0469524], [-2.76789, -0.0416676, -1.18993, 0.392875, 0.136845, -0.263689, -0.402386, 0.206513, 0.335653, 0.0999453, 0.0125673, 0.226993], [-2.57943, -0.102039, -1.46225, 0.550504, 0.103789, 0.0240493, -0.116903, 0.25877, 0.189019, -0.107692, -0.134221, 0.333413], [-2.44367, 0.119016, -0.61038, 0.896835, 0.0487419, 0.281915, -0.0475086, -0.145234, 0.126528, -0.109666, 0.0714544, 0.102345], [-2.73143, 0.317259, -0.546473, 0.842293, -0.228764, 0.0580869, -0.128803, -0.523804, 0.0935071, -0.0131786, -0.0838011, -0.299564], [-2.86395, 0.282303, -1.00826, 0.65241, -0.317471, -0.0948204, 0.186242, -0.214155, 0.0747489, -0.163622, -0.00290485, -0.0116438], [-2.96273, 0.210327, -0.76213, 0.743427, -0.435498, -0.249532, 0.249474, -0.160216, -0.12336, -0.240312, -0.270668, -0.133469], [-3.35801, 0.362276, -0.507548, 0.301616, -0.583986, -0.424966, 0.0257714, -0.11669, 0.201161, 0.0104573, -0.267932, 0.164152], [-3.52099, 0.489393, -0.45938, 0.0439511, -0.250481, -0.490404, -0.0479253, 0.13449, -0.229827, -0.116102, -0.0683664, -0.0311946], [-3.01492, -0.0464895, -0.166774, -0.147464, -0.258049, -0.401865, 0.0168582, 0.277897, -0.0941365, -0.375444, -0.0174562, 0.0673491], [-3.30715, 0.26851, -0.803025, -0.0088587, -0.258561, -0.369787, 0.0882617, 0.223542, 0.0424378, -0.179769, 0.138257, 0.0615963], [-4.87222, 0.403703, -1.07541, 0.0120966, 0.00684427, -0.111497, 0.164573, 0.410325, -0.364741, 0.0662429, 0.0136844, 0.384867], [-5.87392, -0.310827, -1.04405, 0.176996, -0.131957, 0.2619, 0.0554216, 0.140458, -0.17792, 0.0856086, -0.375274, -0.0801583], [-7.16114, 0.866077, -1.83373, 0.625741, 0.0481332, 0.0240574, -0.135544, 0.294257, 0.0575935, -0.146078, -0.355156, 0.198461]])

def gmmSample(data):
    gmm = mixture.GMM(n_components=3, covariance_type='full', n_iter=100)
    gmm.fit(np.array(data)) 
    gmm.sample(100000)

gmmSample(data_not_working)

This produces the following runtime error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/sklearn/mixture/gmm.py", line 411, in sample
    num_comp_in_X, random_state=random_state).T
  File "/Library/Python/2.7/site-packages/sklearn/mixture/gmm.py", line 102, in sample_gaussian
    s, U = linalg.eigh(covar)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/linalg/decomp.py", line 387, in eigh
    raise LinAlgError("unrecoverable internal error.")
numpy.linalg.linalg.LinAlgError: unrecoverable internal error.

So the problem is in sampling from the GMM, not fitting it. Here is an example of a data instance where the above code works fine (as it does in all the other 1k+ instances I am working with). All the instances have the same shape:

data_working = np.array([[-13.8942, 0.329383, -0.467724, -0.0533347, 0.135847, 0.063669, 0.205088, 0.0188045, 0.200259, -0.153357, 0.0282053, 0.19137], [-10.0263, -0.232325, -1.23603, -0.373344, -0.270465, 0.223835, 0.245468, -0.14771, -0.21643, 0.0690714, 0.00436133, -0.0100653], [-7.2949, -1.02805, -0.360764, -0.211618, -0.0396331, 0.138607, 0.0274424, -0.0949814, -0.0290368, -0.195617, -0.064841, -0.0334741], [-5.27361, -1.45856, -0.0538218, 0.325073, -0.0113113, -0.182038, 0.0113554, 0.0380641, -0.155189, -0.000775465, -0.0834289, -0.00448654], [-3.4687, -1.80423, 0.181359, 0.216309, -0.0175896, -0.14976, 0.011689, -0.123908, -0.234207, 0.0114323, -0.157273, 0.153515], [-5.46375, -1.50817, -0.26668, 0.114913, 0.041553, 0.232375, 0.193539, -0.022985, -0.123261, 0.0131678, -0.225528, 0.0131385], [-8.96966, -0.926118, -1.14693, -0.0732326, -0.069377, 0.202194, -0.0373959, 0.155714, -0.0575818, 0.153754, 0.0827817, -0.0899819], [-5.4489, -1.46598, -0.904309, -0.180178, -0.0387, 0.284963, -0.0209437, 0.161178, -0.334906, 0.0925891, 0.0626761, -0.20815], [-6.67765, -0.909459, -0.893041, -0.528669, -0.287356, -0.317459, 0.0218326, 0.212814, -0.0544577, 0.0569478, -0.21171, -0.166358], [-5.83495, -1.40242, -1.08698, -0.295603, -0.44182, 0.0875251, -0.307424, 0.0605037, 0.142951, 0.0753836, -0.0953188, 0.00819761], [-5.92017, -1.05822, -0.898107, -0.0233588, -0.318233, -0.266055, -0.458731, 0.132217, -0.107108, -0.154634, -0.00669574, 0.142476], [-6.2026, -1.71479, -0.465533, -0.26163, 0.303861, -0.00872642, 0.155504, 0.614625, -0.207519, -0.212606, -0.0592188, 0.0887861], [-10.7305, -1.13431, -0.979158, 0.219761, -0.342731, -0.175846, 0.0111934, 0.226708, -0.0161784, -0.248745, 0.0470983, -0.0252792], [-8.0586, -1.45944, -1.18256, 0.0650664, 0.259971, -0.285369, -0.202342, 0.0675689, -0.238931, -0.0665339, 0.0854533, 0.0714763], [-5.61462, -1.77467, -1.17853, -0.402395, 0.0316058, -0.358417, -0.212316, 0.215444, 0.0111266, -0.17753, 0.106201, 0.102555], [-7.32914, -1.46897, -1.03672, 0.209392, -0.032743, -0.0519038, -0.30758, -0.377465, -0.329729, 0.0569532, -0.0359641, 0.182907], [-6.88854, -1.81873, -0.421743, -0.312312, -0.218102, 0.10227, -0.200002, -0.161226, -0.319451, 0.21934, -0.203555, -0.0566904], [-5.54895, -1.97478, -0.552426, -0.232346, -0.192567, -0.213922, -0.118116, 0.0830695, 0.0688067, 0.163558, 0.0393377, 0.269313], [-6.06666, -1.81661, -0.410524, -0.135279, -0.0956775, -0.269271, -0.164703, -0.0854252, -0.113826, 0.003071, 0.0617395, 0.247204]])

Interstingly, if I drop the number of samples being taken from the GMM to 10 then it works sometimes, but not every time!

From looking at the data a bit more, it looks like the # of components for data_not_working might be <= 2. It runs without error when dropping the # components to 2. So trying to model this with 3 components might be causing an issue. However, I still don't understand what is causing this error and if it is a bug in the library.

I have also tried running the same code on different systems now. It seems to work on some but not others. This does not seem to be affected by python or libary versions (2 machines running same disc image, python, scipy, numpy and sklearn versions; 1 works, other doesn't)...very strange.

Am I missing something obvious or is there an issue with the library? Thanks

Amir
  • 10,600
  • 9
  • 48
  • 75
tribalsoul
  • 121
  • 2
  • works for me (ran ~1000 times in a loop) – lejlot Jan 08 '16 at 00:03
  • I suggest to try to print the fitted model (means and covariances) to see what happened. I guess some numerical instability could create invalid covariance matrices (however - it works just fine for me) – lejlot Jan 08 '16 at 00:06

0 Answers0