2

I am using multi-variate guassian distribution to analyze abnormality. This is how the training set looks

19-04-16    05:30:31    1   0   0   377816  305172  5567044 0   0   0   14  62  75  0   0   100 0   0
<Date>      <time>     <--------------------------- -------   Features --------------------------->

Lets say one of the above features do not change, they remain zero.

Calculation mean = mu

mu = mean(X)'

Calculating sigma2 as

sigma2 = ((1/m) * (sum((X - mu') .^ 2)))'

Probability of individual feature in each data set is calculated using standard gaussian formula as

guassian

For a particular feature, if all values come out to be zero, then mean (mu) is also zero. Subsequently sigma2 will also be zero. Thereby when I calculate the probability through gaussian distribution, I would get a "device by zero" problem.

However, in test sets, this feature value can fluctuate and I would like term that as a an abnormality. How, should this be handled? I dont want to ignore such a feature.

Anugraha Sinha
  • 621
  • 6
  • 13
  • If a feature is truly constant across all the instances, then it's useless for classification and it can be removed – Dima Svider Dec 21 '21 at 03:49

1 Answers1

1

So - the problem occurs every time when you have a variable which is constant. But then approximating it by a Normal Distribution has absolutely no sense. The whole information about such variable is contained in only one value - and this is an intuition why this division by 0 phenomenon occurs.

In case when you know that there are these fluctuations in your variable not observed in a training set - you could simply set a variance of such variable not to be lesser than a certain value. You could apply a function max(variance(X), eps) instead of a classic variance definition. Then - you will be sure that no division by 0 occurs.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
  • Thanks for the inputs. I will try and update about the results. – Anugraha Sinha Jul 05 '16 at 04:50
  • And? Does my answer helped you? – Marcin Możejko Jul 06 '16 at 13:03
  • once again thanks for the suggestion. I am sorry for the late reply. Yes it does work. And what I understand from you suggestion is (please confirm) that we are trying to incorporate a small "variance" to that feature (which I would probably add if the mean/std comes out to be zero) so that the smallest of deviation from this value (eps) could be termed as abnormality. EPS would be (as by definition) would be the spacing between 2 adjacent number in the "machine's" floating point system. I think this should do the trick. :-) – Anugraha Sinha Jul 08 '16 at 13:20