google refine: use facet tools to infer map between two columns

Question

I've been searching but haven't found how to do this in refine.

I've got two columns of unique IDS. For each a in A, I want to find the top 10 closest matches in B.

My backup plan is to just use Levenshtein to iterate ... but Refine has such a nice iterface and many more algorithms implemented that I was hoping to be able to do some of the work using it.

Or is there another tool for doing this?

What is the definition of "closest match"? Are the IDs numeric? If there's a way to cluster the IDs, you could split the columns into two projects and use the cross() function to do the lookup on a cluster ID. — Tom Morris, Mar 21 '13 at 21:12
I should have been more specific. The IDs are text fields, with a lot of bad abbreviations on one side. After some investigation it looks like the matches just won't work very well without extra data for this particular set. I'll have a look at "cross()" ... I did not know about that functionality. — mathtick, Mar 22 '13 at 01:39

score 1 · Accepted Answer · answered Apr 06 '13 at 18:06

Did you know you can use clustering algorithm like fingerprint or ngramFingerprint (source) out of the clustering interface in Refine?

Using you IDS field, create a new column based on this column with the following expression: ngramFingerprint(value)

You can now cross with your other data set on this new column. This might help to get more matches.

google refine: use facet tools to infer map between two columns

1 Answers1