OpenRefine - Cross-column clustering

Question

As it seems, cross-column clustering isn't supported yet with OpenRefine.

Does anyone have any suggestions of how to cluster 'models' based on 'manufacturers', much like a 'city' would be based on a 'state' (many 'Springfield' could exist in the US, but only cluster "city": 'Springfield', if the relative 'state' column is the same)? The relative column is already normalized.

score 0 · Accepted Answer · answered Feb 26 '14 at 03:44

0

One easy way to do it would be to create a column which was the concatenation of the model+manufacturer, cluster on the joined fields, then (if needed) split the two pieces back apart again.

answered Feb 26 '14 at 03:44

Tom Morris

10,490
32
53

Thyag · Answer 2 · 2017-02-11T03:26:48.093

I had a similar requirement for de-duplicating address strings. So I created a new column (say COMPLETE_ADDRESS) and concatenated the STREET, CITY, PROVINCE, COUNTRY and ZIPCODE fields using the below GREL expression

cells["STREET"].value + " " + cells["CITY"].value + " " + cells["PROVINCE"].value + " " + cells["COUNTRY"].value + " " + cells["ZIPCODE"].value

Then I did the following :

Clustered the new COMPLETE_ADDRESS column with the default algorithm
Merged the values in each cluster (now the values are perfect duplicates)
Sort the column permanently.
Do a "blank down" operation.
Finally pick only non-null values in the COMPLETE_ADDRESS

Having said that, as of this writing, there is no feature to merge the independent columns. The only way to do that it is to split the COMPLETE_ADDRESS into separate columns suitably. In this case, you will have to use a better separator such as pipe "|" symbol which will not conflict with existing values.

OpenRefine - Cross-column clustering

2 Answers2