9

I have two sequences of length n and m. Each is a sequence of points of the form (x,y) and represent curves in an image. I need to find how different (or similar) these sequences are given that fact that

  1. one sequence is likely longer than the other (i.e., one can be half or a quarter as long as the other, but if they trace approximately the same curve, they are the same)
  2. these sequences could be in opposite directions (i.e., sequence 1 goes from left to right, while sequence 2 goes from right to left)

    I looked into some difference estimates like Levenshtein as well as edit-distances in structural similarity matching for protein folding, but none of them seem to do the trick. I could write my own brute-force method but I want to know if there is a better way.

Thanks.

WanderingPhd
  • 189
  • 1
  • 9
  • Can you be a bit more specific about what you want out of your edit distance? Are you trying to come up with some sort of metric space over the sequences? Are you just trying to rank matches? Or are you just looking for a subjective measure of similarity? – templatetypedef Jun 20 '11 at 21:59
  • @templatetypedef - What I really want to do is classify sequences (routes) into groups. I was hoping to do this by calculating a similarity or difference value and use a pre-defined threshold value to assign that sequence to a group. – WanderingPhd Jun 20 '11 at 22:05
  • When you say that one sequence is shorter than the other, do you mean that it follows only a portion of the other curve, or that it follows the same curve but has fewer points? – Beta Jun 20 '11 at 22:53
  • @Beta - It is possible for it to be both. These sequences are locations of objects traversing the space and the incoming data is error-prone - it could start from the middle of (only a portion of the curve) or there could be missing informaton (fewer points). – WanderingPhd Jun 21 '11 at 01:07
  • Are there some more info about the curves? Could they self-intersect, for example? How well a Catmull-Rom could approximate the curves? etc ... – Dr. belisarius Jun 21 '11 at 09:47
  • @belisarius - Yes, they could self-intersect. I'm not familiar with Catmull-Rom but from what I read online, I think it would. – WanderingPhd Jun 21 '11 at 14:56

5 Answers5

3

Do you mean that you are trying to match curves that have been translated in x,y coordinates? One technique from image processing is to use chain codes [I'm looking for a decent reference, but all I can find right now is this] to encode each sequence and then compare those chain codes. You could take the sum of the differences (modulo 8) and if the result is 0, the curves are identical. Since the sequences are of different lengths and don't necessarily start at the same relative location, you would have to shift one sequence and do this again and again, but you only have to create the chain codes once. The only way to detect if one of the sequences is reversed is to try both the forward and reverse of one of the sequences. If the curves aren't exactly alike, the sum will be greater than zero but it is not straightforward to tell how different the curves are simply from the sum.

This method will not be rotationally invariant. If you need a method that is rotationally invariant, you should look at Boundary-Centered Polar Encoding. I can't find a free reference for that, but if you need me to describe it, let me know.

Luke Postema
  • 375
  • 1
  • 5
  • It is important to know how different a curve is from another since that's how I'm classifying curves though it looks like the sum would give me a rough idea if the curve classes are sufficiently different. Also, the absolute coordinates of the curves are important not just their spatiality (two curves that look the same but are in different parts of the image are different) and it seems from my reading that chain codes are translationally invariant? If so, it would not help. – WanderingPhd Jun 21 '11 at 14:04
2

A method along these lines might work:

For both sequences:

Fit a curve through the sequence. Make sure that you have a continuous one-to-one function from [0,1] to points on this curve. That is, for each (real) number between 0 and 1, this function returns a point on the curve belonging to it. By tracing the function for all numbers from 0 to 1, you get the entire curve.

One way to fit a curve would be to draw a straight line between each pair of consecutive points (it is not a nice curve, because it has sharp bends, but it might be fine for your purpose). In that case, the function can be obtained by calculating the total length of all the line segments (Pythagoras). The point on the curve corresponding to a number Y (between 0 and 1) corresponds to the point on the curve that has a distance Y * (total length of all line segments) from the first point on the sequence, measured by traveling over the line segments (!!).

Now, after we have obtained such a function F(double) for the first sequence, and G(double) for the second sequence, we can calculate the similarity as follows:

double epsilon = 0.01;
double curveDistanceSquared = 0.0;
for(double d=0.0;d<1.0;d=d+epsilon)
{
   Point pointOnCurve1 = F(d);    
   Point pointOnCurve2 = G(d); 
   //alternatively, use G(1.0-d) to check whether the second sequence is reversed       
   double distanceOfPoints = pointOnCurve1.EuclideanDistance(pointOnCurve2);
   curveDistanceSquared = curveDistanceSquared + distanceOfPoints * distanceOfPoints;
}
similarity = 1.0/ curveDistanceSquared;

Possible improvements:

-Find an improved way to fit the curves. Note that you still need the function that traces the curve for the above method to work.

-When calculating the distance, consider reparametrizing the function G in such a way that the distance is minimized. (This means you have an increasing function R, such that R(0) = 0 and R(1)=1, but which is otherwise general. When calculating the distance you use

   Point pointOnCurve1 = F(d);    
   Point pointOnCurve2 = G(R(d)); 

Subsequently, you try to choose R in such a way that the distance is minimized. (to see what happens, note that G(R(d)) also traces the curve)).

willem
  • 2,617
  • 5
  • 26
  • 38
  • Thanks, I think this will work. The only difference I would make to it is to restrict the size of the larger curve to the smaller one and that should take care of situations where one curve is a subseq of the other. I will try it out and report back. Is there a name for this technique? It looks similar to [SSAP](http://en.wikipedia.org/wiki/Structural_alignment#SSAP) in protein folding – WanderingPhd Jun 21 '11 at 14:39
  • From you question, I understood that the two curves were equally long when measured in the two-dimensional space, but the number of points for the two curves differ? Now I understand that the length of the curves can be different, and the number of points is also different. You can accomodate for this case in a similar way as my remarks on reparametrization. I don't know the name of this technique.. – willem Jun 21 '11 at 14:51
1

Why not do some sort of curve fitting procedure (least-squares whether it be ordinary or non-linear) and see if the coefficients on the shape parameters are the same. If you run it as a panel-data sort of model, there are explicit statistical tests whether sets of parameters are significantly different from one another. That would solve the problem of the the same curve but sampled at different resolutions.

Samsdram
  • 1,615
  • 15
  • 18
  • That could work. However, it should not be translation invariant since the absolute locations of the points on the curve matter. Is that possible as well? – WanderingPhd Jun 21 '11 at 14:43
1

Step 1: Canonicalize the orientation. For example, let's say that all curved start at the endpoint with lowest lexicographic order.

def inCanonicalOrientation(path):
    return path if path[0]<path[-1] else reversed(path)

Step 2: You can either be roughly accurate, or very accurate. If you wish to be very accurate, calculate a spline, or fit both curves to a polynomial of appropriate degree, and compare coefficients. If you'd like just a rough estimate, do as follows:

def resample(path, numPoints)
    pathLength = pathLength(path)  #write this function

    segments = generateSegments(path)
    currentSegment = next(segments)
    segmentsSoFar = [currentSegment]

    for i in range(numPoints):
        samplePosition = i/(numPoints-1)*pathLength
        while samplePosition > pathLength(segmentsSoFar)+currentSegment.length:
            currentSegment = next(segments)
            segmentsSoFar.insert(currentSegment)
        difference = samplePosition - pathLength(segmentsSoFar)
        howFar = difference/currentSegment.length
        yield Point((1-howFar)*currentSegment.start + (howFar)*currentSegment.end)

This can be modified from a linear resampling to something better.

def error(pathA, pathB):
    pathA = inCanonicalOrientation(pathA)
    pathB = inCanonicalOrientation(pathB)

    higherResolution = max([len(pathA), len(pathB)])
    resampledA = resample(pathA, higherResolution)
    resampledB = resample(pathA, higherResolution)

    error = sum(
        abs(pointInA-pointInB)
        for pointInA,pointInB in zip(pathA,pathB)
    )
    averageError = error / len(pathAorB)
    normalizedError = error / Z(AorB)
    return normalizedError

Where Z is something like the "diameter" of your path, perhaps the maximum Euclidean distance between any two points in a path.

ninjagecko
  • 88,546
  • 24
  • 137
  • 145
  • Thanks for the detailed reply. I haven't had a chance to check it out since I went with an algorithm someone else posted but when I get some time, I'll try yours out as well. – WanderingPhd Jun 25 '11 at 12:03
  • @WanderingPhd: this answer is basically the same as the accepted answer, sketching out how to do so with line segments (can modify to use splines). – ninjagecko Jun 25 '11 at 14:37
0

I would use a curve-fitting procedure, but also throw in a constant term, i.e. 0 =B0 + B1*X + B2*Y + B3*X*Y + B4*X^2 etc. This would catch the translational variance and then you can do a statistical comparison of the estimated coefficients of the curves formed by the two sets of points as a way of classifying them. I'm assuming you'll have to do bi-variate interpolation if the data form arbitrary curves in the x-y plane.

Marty B
  • 243
  • 3
  • 10