I have a task to use machine learning for anomaly detection. I have data as sales count info like that:
{ 5/*bread*/, 10/*milk*/, 2/*potato*/, .../*other products*/ },
{ 6, 9, 3, ... }, { 5, 12, 1, ... },
{ 10/*bread sales count is anomaly high*/, 10, 2, ... },
{ 4, 8, 3, ... }
I coded learning set generator based on idea to turn certain product sales count into array { 5, 6, 5, 10, 4, ... }, calc mean value (let's assume it 5), turn array into percents diff from mean value { 0%, 20%, 0%, 100%, -20%, ... }, so if array does not contains values higher by module than 5%, then there is no anomalies in row. I know I can use simplest self-made functionc to check this, but I HAVE a task to use machine learning for that.
My generator were making sequences like { 1%, 3%, -2%, 5%, 1%, ...} and were labeling them as good. That way it made around 1k good sequences.
After this generator starts to generate anomal sequences by modifing good ones into this: { 24%, 3%, -2%, 5%, 1%, ... }, { -24%, 3%, -2%, 5%, 1%, ...}, { 1%, 24%, -2%, 5%, 1%, ... }, { 1%, -24%, -2%, 5%, 1%, ...}, ..., { 1%, 3%, -2%, 5%, -100% }
Later I transformed this percents into [0, 1] range and fed to multilayer perceptron with 128 neurons in second layer, 32 in third and 2 in output (good or anomaly). After learning I got around 50% recognition rate that is terribly bad.
Then I modified my generator to generate 1k good sequences like that { 1%, 3%, -2%, 5%, 1%, ...} and 1k bad sequences like that { 25%, 50%, -60%, 40%, -80%, ...}. The recognition rate still were around 50%.
Which way is possible to generate learning set, so later network will tell that { 1%, 3%, -2%, 5%, 1%, ...} is good and any of { 1%, -24%, -2%, 5%, 1%, ...} is bad?