1

I am trying to use Expectation Maximization (EM) to classify my data. But due to my limited experience with Python and algorithms, when I writing the E-step of the algorithm, the method multivariate_normal.pdf() returns zeros and I have no idea why does this happen and how do I fix it.

My parameters set:

n_clusters = 2
n_points = len(X)
m1 = X[:,0].mean() 
m2 = X[:,1].mean() 
Mu = [[m1, m2*1.1], [m1, m2*0.9]]
Var = [[1, 1], [1, 1]]
Pi = [1 / n_clusters] * 2
W = np.ones((n_points, n_clusters)) / n_clusters 
Pi = W.sum(axis=0) / W.sum()

E-step: enter image description here

Function for E-step:

def update_W(X, Mu, Var, Pi):
    n_points, n_clusters = len(X), len(Pi)
    pdfs = np.zeros(((n_points, n_clusters)))
    for i in range(n_clusters):
        pdfs[:, i] = Pi[i] * multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))
    W = pdfs / pdfs.sum(axis=1).reshape(-1, 1)
    return W

The problem is that the code

multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))

returns zeros and it does not make sense to the algorithm.

My data is here:

X = np.array([[5.04729, 37744.57848238945],
 [5.0008, 35899.206357479095],
 [4.98011, 33056.83955574036],
 [4.9931, 33704.46286725998],
 [4.96448, 32990.76251792908],
 [4.99008, 32283.805765628815],
 [5.04159, 34409.80894064903],
 [5.00075, 35270.9743578434],
 [5.00899, 36284.97508478165],
 [4.97215, 35458.088497161865],
 [4.9763, 35349.08898703754],
 [5.02148, 38928.51481676102],
 [4.96585, 33278.32470035553],
 [5.02545, 41023.1229429245],
 [5.00402, 39417.85598278046],
 [5.02445, 38158.13270044327],
 [5.02238, 36516.55680179596],
 [4.98606, 34646.19884157181],
 [5.0033, 50677.906705379486],
 [5.02733, 49949.25080680847],
 [5.02716, 39155.43921852112],
 [5.04183, 39191.40325546265],
 [4.96896, 38266.83190345764],
 [4.99652, 36774.488584041595],
 [5.03478, 33207.51669681072],
 [5.03832, 34724.067808151245],
 [5.02347, 32981.52464199066],
 [4.96275, 33158.869076013565],
 [4.97127, 33810.3498980999],
 [5.01482, 35844.073690891266],
 [4.97365, 36869.271273851395],
 [5.00685, 34672.39159619808],
 [4.98033, 34688.13522028923],
 [5.00265, 32825.98570096493],
 [5.00083, 34160.511184334755],
 [5.01027, 33884.14221894741],
 [4.95678, 38457.50146174431],
 [4.96867, 36320.28945875168],
 [5.00721, 33037.772767663],
 [5.04357, 33117.379173755646],
 [5.0004, 34557.091692984104],
 [5.02337, 34421.044409394264],
 [5.01924, 36974.69707453251],
 [5.04621, 38812.16114796698],
 [4.96225, 33006.868394851685],
 [5.03166, 33958.025827884674],
 [4.95124, 34349.11486387253],
 [5.02567, 34259.3874104023],
 [5.04868, 41895.62332487106],
 [5.00072, 34009.417558074],
 [5.04818, 35927.81018912792],
 [4.97368, 33997.19724833965],
 [4.97671, 33999.519283890724],
 [4.95128, 31926.064946889877],
 [4.95742, 33177.98467195034],
 [4.9803, 36231.32753157616],
 [4.99018, 35190.21499049879],
 [5.00053, 34551.94658563426],
 [4.96793, 33994.06255364418],
 [4.98345, 32120.49098587036],
 [5.04564, 30271.260227680206],
 [4.95421, 30653.037763834],
 [5.03442, 31966.675320148468],
 [4.9722, 33757.07542133331],
 [5.01377, 33405.48862147331],
 [5.03417, 33558.4126021862],
 [5.02044, 30911.697856664658],
 [4.98318, 33883.324236392975],
 [4.99876, 36436.7568192482],
 [4.98475, 34489.98227977753],
 [5.00355, 30205.769625544548],
 [5.03646, 33422.97079265118],
 [5.04884, 34117.2653195858],
 [5.02259, 34993.6039069891],
 [4.9737, 47981.5806927681],
 [5.00593, 56685.62618494034],
 [5.00434, 54395.169895286555],
 [4.95768, 35611.35310435295],
 [5.02978, 37536.10489606857],
 [4.95214, 35577.900700092316],
 [4.99925, 33014.68128633499],
 [5.03029, 29865.81333065033],
 [4.95128, 28760.52792918682],
 [5.02051, 37038.269605994225],
 [5.04974, 36642.584582567215],
 [4.96896, 29041.81211209297],
 [4.97972, 37114.91027057171],
 [4.99165, 36424.69217658043],
 [4.95779, 33859.12481832504],
 [5.01495, 32957.517973423004],
 [4.95671, 32964.28135430813],
 [5.00732, 34051.169529914856],
 [4.98396, 32881.329072237015],
 [5.02704, 34020.69495886564],
 [4.98323, 34851.629052996635],
 [5.02463, 34762.618599653244],
 [4.95826, 31249.48672771454],
 [5.02317, 35223.82992994785],
 [4.98121, 31181.02133178711],
 [5.00584, 28375.87954646349],
 [4.98506, 31617.50852251053],
 [5.02145, 31114.64755296707],
 [5.03362, 31182.910351395607],
 [5.03503, 30093.33342540264],
 [4.95797, 36398.88142120838],
 [4.97227, 35468.531022787094],
 [5.00365, 35569.601973861456],
 [4.96697, 33233.918072104454],
 [5.01004, 31831.104349255562],
 [5.01838, 36176.26664984226],
 [5.01747, 35596.35544002056],
 [5.0358, 32386.7973613739],
 [4.973, 33052.33617353439],
 [5.03017, 32395.227126598358],
 [4.98371, 34437.29500854015],
 [5.02143, 31013.464814782143],
 [5.02339, 36368.26537466049],
 [5.02942, 36371.447618722916],
 [5.04456, 38037.0499355793],
 [5.02219, 40177.06857061386],
 [4.95168, 37964.482201099396],
 [4.99589, 38579.15203022957],
 [4.99336, 34731.92454767227],
 [4.98333, 34161.41907787323],
 [4.95841, 30232.777291665203],
 [4.95133, 30595.70202767849],
 [5.0252, 30708.32647585869],
 [5.04505, 35107.191153526306],
 [4.98019, 35513.3441824913],
 [4.97158, 30879.74526977539],
 [4.96096, 33175.533081412315]])

Thanks in advance!

Rick
  • 61
  • 7
  • 1
    please define/link `EM` for readers who may be unfamiliar with the term – anon01 May 26 '21 at 07:14
  • please include the code you use to call your `update_W` function – rudolfovic May 26 '21 at 07:55
  • @rudolfovic Thanks for your advice. I did not call the function before I post the question, I was just testing the code in function piece by piece. I just tried and calling `update_W(X, Mu, Var, Pi)` returns nans because `pdfs` is zeros. – Rick May 26 '21 at 08:09
  • where are you importing your `multivariate_normal` from? – rudolfovic May 26 '21 at 08:13
  • `from scipy.stats import multivariate_normal` – Rick May 26 '21 at 08:17

1 Answers1

1

Let's have a look at some actual numbers:

>>> Mu[0]
[5.000155496183207, 38591.28312042943]
>>> Var[0]
[1, 1]
>>> X[:5]
array([[5.04729, 37744.5785],
       [5.00080, 35899.2064],
       [4.98011, 33056.8396],
       [4.99310, 33704.4629],
       [4.96448, 32990.7625]])

Now, let's try the following value and see what we get:

>>> multivariate_normal.pdf([5, 38561.5785], Mu[0], np.diag(Var[0]))
3.970168119488602e-193

You get a really really low probability value but at least it's not zero.

Let's ignore the first dimension for a second as it isn't the problematic one: 38561 is about 30 standard deviations away from 38591.

For a univariate normal distribution we know that the probability of being more than 3 standard deviations away from the mean is just 0.3% and it decreases very very fast:

enter image description here

Look at the values stored in X - they are much much further away from Mu[0] than 30 standard deviations. These values are so unlikely to come from this distribution that as far as Python is concerned, the probability of that happening is 0.0.

A better idea would be to start with the actual covariance of your data:

Var = np.cov(X.T) + 1
Var = [Var, Var]

Then suddenly you get decent probabilities :)

P.S. I am adding 1 to the covariance matrix to avoid singularity.

rudolfovic
  • 3,163
  • 2
  • 14
  • 38
  • Thank you very much for your answer! It really inspired me, I ignored the data scale. What do you think if I use min-max normalization on my data? Thanks again. – Rick May 26 '21 at 09:28
  • You're welcome. MMN won't change the distribution of your data so if you have a lot of outliers, points very distant from your mean, they will still be relatively far from the mean after you scale. That's pretty much exactly why standardisation is applied instead in these cases - to do which you need to compute your sample (co)variance. – rudolfovic May 26 '21 at 09:36