5

i would like to find what the best way to detect outliers is. here is the problem and some things which probably will not work. let's say we want to fish out some quasi-uniform data from a dirty varchar(50) column in mysql. let's start by doing an analysis by string length.

| strlen |  freq  | 
|      0 |   2312 |
|      3 |     45 |
|      9 |     75 |
|     10 |  15420 |
|     11 |    395 |
|     12 |    114 |
|     19 |     27 |
|     20 |   1170 |
|     21 |     33 |
|     35 |     9  |

what i would like to do is devise an algorithm to determine which string length has a high probability of being purposefully unique rather than being typeo's or random garbage. this field has the possibility of being an "enum" type, so there can be several frequency spikes for valid values. clearly 10 and 20 are valid, 0 is just omitted data. 35 and 3 might be some random trash despite both being very different in frequency. 19 and 21 might be type-os around the 20 format. 11 might be type-os for 10, but what about 12?

it seems simply using occurrence frequency % is not enough. there need to hotspots of higher "just an error" probability around the obvious outliers.

also, having a fixed threshold fails when there are 15 unique lengths which can vary by between 5-20 chars, each with between 7% - 20% occurrence.

standard deviation will not work because it relies on the mean. median absolute deviation probably wont work because you can have a high frequency outlier that cannot be discarded.

yes there will be other params for cleaning the data in the code, but length seems to very quickly pre-filter and classify fields with any amount of structure.

are there any known methods which would work efficiently? i'm not very familiar with Bayesian filters or machine learning but maybe they can help?

thanks! leon

leeoniya
  • 1,071
  • 1
  • 9
  • 25

1 Answers1

2

Sounds like anomaly detection is the way the to go. Anomaly detection is a kind of machine learning that is used to find outliers. It comes in a couple of varieties, including supervised and unsupervised. In supervised learning, the algorithm is training using examples of outliers. In unsupervised learning, the algorithm attempts to find outliers without any examples. Here are a couple of links to start out:

http://en.wikipedia.org/wiki/Anomaly_detection

http://s3.amazonaws.com/mlclass-resources/docs/slides/Lecture15.pdf

I didn't find any links to readily available libraries. Something like MATLAB, or its free cousin, Octave, might be a nice way to if you can't find an anomaly detection library in your language of choice. https://goker.wordpress.com/tag/anomaly-detection/

ahoffer
  • 6,347
  • 4
  • 39
  • 68
  • thanks, this is very interesting and looks to be spot-on, but does not seem easy enough to implement to be worthwhile for my project :( i'm going to keep exploring some simpler weighted-average implementation that's decently effective. – leeoniya Jan 20 '12 at 06:36
  • @leeoniya. Local outlier factor looks like a great choice. The article you link to includes a plot of Dim2 versus Dim1, which in this case would be Frequency versus String Length. ...It looks reasonable simple to implement, but there isn't much to go on in the article. Can you point us to some sources other than the references to papers at the end of the article? – ahoffer Jan 20 '12 at 08:40
  • i'm still researching this, it looks like this might be a great solution: http://en.wikipedia.org/wiki/DBSCAN another one is SNN: http://get.dsi.uminho.pt/local/ – leeoniya Jan 20 '12 at 15:28
  • @leeoniya. DBSCAN would work too. You have to select k, the minimum number points in a cluster, but with a little experimentation, you should be able to find some good values of k. – ahoffer Jan 20 '12 at 16:48