1

I have a list of coordinates, coord, which looks like this when plotted:

this

I want to remove the long string of points that goes completely from 0 to 1 from the data set, shown on this plot starting at (0, 11) and ending at (1, 11) and the other one that begins at (0, 24) and ends at (1, 28).

So far, I have tried using kmeans to group the data by height using this code:

jet = colormap('jet');

amount = 20;
step = floor(numel(jet(:,1))/amount);
idxOIarr = cell(numel(terp));
scale = 100;

for ii = 1:numel(terp)
    figure;
    hold on;
    expandDat = [stretched{ii}(:,1), scale.*log(terp{ii}(:,2))];
    [idx, cent] = kmeans(expandDat(:,1:2), amount, 'Distance', 'cityblock');
    idxOIarr{ii} = idx;
    for jj = 1:amount
        scatter(stretched{ii}(idx == jj,1), FREQ(terp{ii}(idx == jj,2)), 10, jet(step*jj,:), 'filled');
    end
end

resulting in this image: this image Although it does separate the higher rows quite well, it breaks the line in the middle in two and groups the line that begins at (0,20) with some data points below it.

Is there any other way to group and remove these points?

kmarx
  • 15
  • 6
  • 1
    You want to group points based on their distances. Use an R-tree to find the nearest neighbors to each point. This will give you a graph, where each two points you consider "connected" are joined by an edge. Next, find the connected components in the graph that stretch from one side to the other. – Cris Luengo Aug 01 '18 at 18:28
  • 1
    Hi @CrisLuengo, thanks for the reply. If you have the time, can you make an answer where you explain this? I never heard of R-trees before and am unsure of how to implement your suggestion. – kmarx Aug 01 '18 at 18:39

1 Answers1

0

The most efficient way to solve this involves building a graph where each point is a vertex. You join points that you consider "connected" or "closed" with an edge. Thus, the graph will connected components. Now you need to look for the connected components that span the whole range from 0 to 1.

  1. Build the graph. Finding neighbors is most efficient using an R-tree. Here are some suggestions. You can also use a k-d tree, for example. However, this is not strictly necessary, it just can get really slow without a proper spatial indexing structure, because you'll have to compare distances between each pair of points.

    Given a Nx2 matrix coord, you can find the square distances between each pair:

    D = sum((reshape(coord,[],1,2) - reshape(coord,1,[],2)).^2,3);
    

    (note again that this is expensive if N is large, and in that case using an R-tree will speed things up significantly). D(i,j) is the distance between points with indices i and j (i.e. coord(i,:) and coord(j,:).

    Next, build the graph, G, nodes i and j are connected if G(i,j)==1. G is a symmetric matrix:

    G = D <= max_distance;
    
  2. Find connected components. A connected component is just a set of nodes that you can reach from each other by following edges. You don't really need to find all connected components, you just need to find the set of points that have x=0, and starting from each, recursively visit all elements in its connected component to see if you can reach a point that has x=1.

    This next code is not tested, but helpfully it gives a starting point:

    start_indices = find(coord(:,1)==0);  % Is exact equality appropriate here?
    end_indices = find(coord(:,1)==1);
    to_remove = [];
    visited = false(size(coord,1), 1);
    for ii=start_indices.'
       % For each point with x=0, see if we can reach any of the points at x=1
       [res, visited] = can_reach(ii, end_indices, G, visited);
       if res
          % For this point we can, remove it!
          to_remove(end+1) = ii;
       end
    end
    
    % Iterative function to visit all nodes in a connected component
    function [res, visited] = can_reach(start, end_indices, G, visited)
       visited(start) = true;
       if any(start==end_indices)
          % We've reach an end point, stop iterating and return true.
          res = true;
          return;
       end
       next = find(G(start,:));  % find neighbors
       next(visited(next)) = []; % remove visited neighbors
       for ii=next
          [res, visited] = can_reach(ii, end_indices, G, visited);
          if res
             % Yes, we can visit an end point, stop iterating now.
             return
          end
       end
    end
    
Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • Thanks for the answer. I'm having some trouble understanding how I go from my x-y coordinates to nodes on a graph. How do I find `i` and `j`? – kmarx Aug 01 '18 at 19:28
  • OK, I believe I understand now; the node number is index of the coordinate. Please correct me if this is incorrect. – kmarx Aug 01 '18 at 19:46
  • @kmarx: indeed, the point at `coord(i,:)` is node `i` in the graph. Are you OK with the iterative algorithm you'll need to traverse each connected component? – Cris Luengo Aug 01 '18 at 19:56
  • I tried implementing it, but I was having some troubles. I would appreciate it if you provided a more detailed overview for step 2 of your solution. – kmarx Aug 02 '18 at 13:17
  • @kmarx: more of an outline for the code, I haven't tried running it. I hope it's a good starting point for you. – Cris Luengo Aug 02 '18 at 19:53