1

I have a sensor that output data consist of one attribute (mono value). An example of punch of sequenced data is as follows:

sample: 199 200 205 209 217 224 239 498 573 583 583 590 591 594 703 710 711 717 719 721 836 840 845 849 855 855 856 857 858 858 928 935 936 936 942 943 964 977

You can see the data from the first image input.

input

The data is divided into levels. The number of levels is given for me (5 levels in this example). However, the number of samples for each level is unknown, as well as the distances between the levels are also unknown.

I need to exclude the outliers and define the center of each level (look at the second image output.

output

The red samples represent the outliers and the yellow represent the centers of levels). Is there any algorithm, mathematical formula, c++ code may help me to achieve this requirement?

I tried KMeans (with K = 5 in this example) and I got bad result because of the random initial K centroids. Most time some inintial centroids share the same level that let this level become two clusters, whereas other two levels belongs to one cluster. If I set the initial centroids manually by selecting one centroid from each level I get very good results.

peterh
  • 11,875
  • 18
  • 85
  • 108
asker
  • 49
  • 1
  • 7
  • Not really a coding question. You could consider https://cstheory.stackexchange.com/ or https://dsp.stackexchange.com/ or https://scicomp.stackexchange.com/ or https://datascience.stackexchange.com/ (OK, maybe there are to many SE sites...) – JHBonarius Apr 10 '18 at 10:52

4 Answers4

3

if difference between two successive data points if greater than particular value (consider this as Delta )then it belongs to different cluster.

for this data set : 199 200 205 209 217 224 239 498 573 583 583 590 591 594 703 710 711 717 719 721 836 840 845 849 855 855 856 857 858 858 928 935 936 936 942 943 964 977

assume delta be 15 (fine tune this based on Sensor) if successive data points difference is not greater than 15 then they belong to same cluster .you could find the center point by finding mid value of the cluster . if point is having nearby point with difference of delta then it could be considered as outlier .Another options is we can vary the delta based in current value of data set.

WorkaroundNewbie
  • 1,063
  • 9
  • 16
  • How you can define delta. As you can see that the levels' values vary from one level to next level in random way. I need something like "dynamic" delta. Or better if there are some clustering algorithm that classifies the data based on the density. It is clear that the data belongs to one level are very close to each other, so if there is some kind of clustering that can figure out these kind of distribution i think it will solve the problem – asker Mar 23 '18 at 06:42
  • @asker Then do some kind of preprocessing in which you find out something like the average distance between two neighbouring points and the maximal distance and then choose something in between. I mean, I don't think one can state this clearer without knowing anything about the data beforehand. – Aziuth Apr 10 '18 at 11:05
2

This is an extension of the answer of @KarthikeyanMV. +1. Yes, you need to be able to determine a value for Delta. Here is a process that will do that. I am writing my code in R, but I think that the process will be clear.

Presumably, the gaps between groups are bigger than the gaps within any group, so just look at the difference between successive points and ask where the big gaps are. Since you believe that there should be 5 groups, there should be 4 big gaps, so look at the 4th biggest difference.

## Your data
dat = c(199, 200, 205, 209, 217, 224, 239, 498, 573, 583, 
    583, 590, 591, 594, 703, 710, 711, 717, 719, 721, 
    836, 840, 845, 849, 855, 855, 856, 857, 858, 858, 
    928, 935, 936, 936, 942, 943, 964, 977)
(Delta = sort(diff(dat), decreasing=TRUE)[4])
[1] 75

This looks like Delta should be 75, but we failed to account for the outliers. Are there any points that are more than Delta from both the next point above and below? Yes.

BigGaps = diff(dat) >= Delta
(Outliers = which(c(BigGaps, T) & c(T, BigGaps)))
[1] 8

Point 8 is too far away to belong to either the group above or below. So let's remove it and try again.

dat = dat[-Outliers]
(Delta = sort(diff(dat), decreasing=TRUE)[4])
[1] 70
BigGaps = diff(dat) >= Delta
(Outliers = which(c(BigGaps, T) & c(T, BigGaps)))
integer(0)

After we remove point 8 the new Delta is 70. We check for outliers using the new Delta (70) and find none. So let's cluster using Delta = 70.

Cluster = cumsum(c(1, diff(dat)>=Delta))
plot(dat, pch=20, col=Cluster+1)

Clustered data

This mostly found the clusters that you want except that it included the last two points in the highest cluster rather than declaring them to be outliers. I do not see why they should be outliers instead of part of this group. Maybe you could elaborate on why you think that they should not be included.

I hope that this helps.

G5W
  • 36,531
  • 10
  • 47
  • 80
0

Id suggest DBSCAN instead of K-Means.

It is a density based clustering algorithm that groups data points that are in the same proximity as each other without having to define an initial k or centroids like K-Means.

In DBSCAN, distance and k-neighbors are user defined. If you know that Index has a consistent interval, DBSCAN might be suitable to solve your problem.

0

I notice that those levels look somewhat like lines. You could do something like that:

1. sort the points
2. take the first two unprocessed points into an ordered set called the current line
3. lay a line between the first and last point of the set
4. test whether the line formed by the first point and the next unprocessed point
    form a line that has an angle lower than some threshold to the other line
5. If yes, add the point and go to 3
6. If no, store the current line somewhere and start again at 2

You could also start by checking whether the first two points of such a line have an angle to the x-axis that is above another threshold and if so, store the first point as something singular. The outliers.

Another version would be to go only by the angle of the connection of two points to the x-axis. On a level change, there will be a far bigger angle (incline, slope) than between two points on a level.

Aziuth
  • 3,652
  • 3
  • 18
  • 36