2

I have the following use case that OpenRefine seems to be a good candidate to solve. I have an existing, "dirty" product table in my database that looks like this:

id  name
51  Product A
52  product-a
53  product B
54  productb
55  produtc
56  productc

I have a new, "clean" product table that looks like this:

id  name
1   Product A
2   Product B
3   Product C

I'd like to use OpenRefine's clustering to generate a mapping file, to help me map products from the old table to the new table:

id  name        old_id
1   Product A    51
1   Product A    52
2   Product B    53
2   Product B    54
3   Product C    55
3   Product C    56

But I can't quite get OpenRefine to do what I want. Any advice for how to achieve this?

blindsnowmobile
  • 3,868
  • 6
  • 32
  • 47
  • How did you get from the "dirty product" table to the "clean product" table? My initial reaction is that OpenRefine could do this but it feels like it's a repeat of work you've already done if you've already generated the clean table? – Owen Stephens Jun 11 '19 at 08:16
  • 1
    The new table was created from a separate data source, not the original table. There will be overlap between the data, but not 100%. I want to use the fuzzy matching from OpenRefine to match the "old" rows to the "new" rows using the product name. There will be cases where old products won't map to anything in the new table, and that's fine. Those rows will either need to be mapped manually to something in the new table, included in the new table as a separate product, or removed entirely. I want to minimize the manual effort with that cleanup though. @OwenStephens – blindsnowmobile Jun 11 '19 at 16:32
  • I think @Frog23's answer covers the approach you'd need to take, although I've also left a comment on thad-guiry's answer for an alternative – Owen Stephens Jun 13 '19 at 09:01

2 Answers2

2

As it was already pointed out, there is no direct way to achieve this, but with the help support tables and the cross function, you can get the desired result:

  1. you take the column "name" from the dirty table and clean table, and combine them. Don't worry about the ids at this point.
  2. import them into OpenRefine (e.g. as project "product names")
  3. duplicate the column "name" (the only column so far) and name the new one "name_new".
  4. Cluster the column "name_new" and replace all of the old names with the correct new ones. Some manual adjustments might be required at this point. Your result should now look like this:

    name        name_new
    Product A   Product A
    product-a   Product A
    product B   Product B
    productb    Product B
    produtc     Product C
    productc    Product C
    Product A   Product A
    Product B   Product B
    Product C   Product C
    
  5. import the dirty table as "products" and the clean table as "products clean".
  6. in the project "products" transform the column "name" using

    value.cross("product names","name").cells["name_new"].value[0]

  7. rename the column "id" to "old_id"

  8. add a new column based on "name" using

    value.cross("products clean","name").cells["id"].value[0]

    and save it as "id". The table "products" has now the desired structure.

I hope this helps.

Frog23
  • 173
  • 1
  • 4
0

Clustering function is limited to a single column to find similar strings within that column.

OpenRefine doesn't yet have string similarity functions across 2 or more tables or projects (Fuzzy Joins) in the way that your use case presents. You would have to use other tools for this. A common tool that I've seen folks use and express satisfaction with Fuzzy Joining is MS PowerBI (Desktop is Free but has limits on Relationships and Exporting, but Pro version is only $10 a month and canceling anytime) but if you wanted something completely free then a few R packages do this, one of which is https://www.rdocumentation.org/packages/fuzzyjoin/versions/0.1.4

In OpenRefine, we totally want to allow Fuzzy Joins across Projects/datasets in the future and it's on our issue list, but we just haven't had the funding to implement this along with tons of other features we know users would like to see.

Thad Guidry
  • 579
  • 4
  • 8
  • An option for achieving a fuzzy lookup across two data sets with OpenRefine is to use the CSV Reconcile software http://okfnlabs.org/reconcile-csv/ to load your 'clean' data file and try reconciling against this file (using OpenRefine's built in Reconciliation facility). – Owen Stephens Jun 13 '19 at 08:59
  • Thanks for the update. I was able to combine the data into a single file. I used Frog23's approach, but reconcile looks interesting as well. – blindsnowmobile Jun 13 '19 at 17:24