1

I've been searching but haven't found how to do this in refine.

I've got two columns of unique IDS. For each a in A, I want to find the top 10 closest matches in B.

My backup plan is to just use Levenshtein to iterate ... but Refine has such a nice iterface and many more algorithms implemented that I was hoping to be able to do some of the work using it.

Or is there another tool for doing this?

mathtick
  • 6,487
  • 13
  • 56
  • 101
  • 1
    What is the definition of "closest match"? Are the IDs numeric? If there's a way to cluster the IDs, you could split the columns into two projects and use the cross() function to do the lookup on a cluster ID. – Tom Morris Mar 21 '13 at 21:12
  • I should have been more specific. The IDs are text fields, with a lot of bad abbreviations on one side. After some investigation it looks like the matches just won't work very well without extra data for this particular set. I'll have a look at "cross()" ... I did not know about that functionality. – mathtick Mar 22 '13 at 01:39

1 Answers1

1

Did you know you can use clustering algorithm like fingerprint or ngramFingerprint (source) out of the clustering interface in Refine?

Using you IDS field, create a new column based on this column with the following expression: ngramFingerprint(value)

You can now cross with your other data set on this new column. This might help to get more matches.

magdmartin
  • 1,712
  • 3
  • 20
  • 43