0

As it seems, cross-column clustering isn't supported yet with OpenRefine.

Does anyone have any suggestions of how to cluster 'models' based on 'manufacturers', much like a 'city' would be based on a 'state' (many 'Springfield' could exist in the US, but only cluster "city": 'Springfield', if the relative 'state' column is the same)? The relative column is already normalized.

pnuts
  • 58,317
  • 11
  • 87
  • 139
c-griffin
  • 2,938
  • 1
  • 18
  • 25

2 Answers2

0

One easy way to do it would be to create a column which was the concatenation of the model+manufacturer, cluster on the joined fields, then (if needed) split the two pieces back apart again.

Tom Morris
  • 10,490
  • 32
  • 53
0

I had a similar requirement for de-duplicating address strings. So I created a new column (say COMPLETE_ADDRESS) and concatenated the STREET, CITY, PROVINCE, COUNTRY and ZIPCODE fields using the below GREL expression

cells["STREET"].value + " " + cells["CITY"].value + " " + cells["PROVINCE"].value + " " + cells["COUNTRY"].value + " " + cells["ZIPCODE"].value

Then I did the following :

  1. Clustered the new COMPLETE_ADDRESS column with the default algorithm
  2. Merged the values in each cluster (now the values are perfect duplicates)
  3. Sort the column permanently.
  4. Do a "blank down" operation.
  5. Finally pick only non-null values in the COMPLETE_ADDRESS

Having said that, as of this writing, there is no feature to merge the independent columns. The only way to do that it is to split the COMPLETE_ADDRESS into separate columns suitably. In this case, you will have to use a better separator such as pipe "|" symbol which will not conflict with existing values.

Thyag
  • 1,217
  • 13
  • 14