OpenRefine text transform unique() ignoring case

Question

Is there a way to tell uniques() to ignore case?

I have a GREL that runs like

forEach(value.split(","),v,v.trim()).uniques().join(",")

This takes each value in the cell seperated by commas, and then spits out the unique value/s in that cell. Works great however if I have a cell that contains Paul, PAUL it will return both rather than just 'Paul'.

Is it possible to cast all the values temporarily to uppercase to compare and then return the first spelling you had of a unique?

Can you give an example of a starting set of data and what you'd want as the output from it? — Owen Stephens, Oct 19 '16 at 16:40

score 1 · Accepted Answer · answered Oct 19 '16 at 17:28

Checking I understand the problem - for example if you start with:

Paul,PAUL,Edward,edward

Am I right to assume you want to end up with:

Paul,Edward

Assuming I've understood this, then I think the way I'd approach this is to create the unique values ignoring case, then look back to the original values and extract the first one that matches a particular key.

So something like: Duplicate the data into a new column called 'keys' using "Edit column->Add column based on this column" with the GREL transformation:

forEach(value.split(","),v,v.trim().toLowercase()).uniques().join(",")

Now you have:

| Col1                    | keys        |
|-------------------------|-------------|
| Paul,PAUL,Edward,edward | paul,edward |

You can now iterate through the values in the keys column, and find the first value in Col1 which would convert to that key using the same transformation:. To do this on the 'key' column you can use the transformation:

forEach(value.split(","),v,filter(cells["Col1"].value.split(","),w,w.trim().toLowercase()==v)[0]).join(",")

That should leave you with

| Col1                    | keys        |
|-------------------------|-------------|
| Paul,PAUL,Edward,edward | Paul,Edward |

Of course it is worth noting that if your original data is in a different order you'd get a different final value - e.g. "PAUL,Paul,Edward,edward" would end up with "PAUL,Edward". It's possible to do some more work to improve this, but there are going to be limits.

score 1 · Answer 2 · answered Oct 20 '16 at 03:08

You might also have the need, depending on your data, to flip the problem around by going into Records mode and doing a "Split multi-valued cells" instead of your long GREL join() expression...thereby creating extra Record rows of each value.

|Col1                       |Split      |
|Paul, PAUL, Edward, edward |Paul       |
|                           |PAUL       |
|                           |Edward     |
|                           |edward     |

From there you can operate on the Split column or create a new column based on uniques, other Grel expressions, etc. You can do a Fill Down on Col1 so that further operations don't lose their key record of 'Paul, Paul, Edward, edward'.

Remember that OpenRefine has great GREL operations, but a lot of the power comes from Record mode, Facets, and Row operations...not only Column and Cell operations. So don't limit yourself to just long complete Grel syntax. Try to break down the problems using all of OpenRefine's operations and modes.

For more info on 'Split multi-valued cells' and other operations, see our wiki section: https://github.com/OpenRefine/OpenRefine/wiki/Cell-Editing#splitting-multiple-values-within-cells-to-produce-records

Thanks the record mode has helped me out in other problem. – Paul M Oct 21 '16 at 15:55 — Paul M, Oct 21 '16 at 15:55

OpenRefine text transform unique() ignoring case

2 Answers2