Unicode regex expression to POSIX expression conversion to support in Redshift/ Postgresql

Question

I've been trying to convert a unicode regex to POSIX regex to remove \p{So} , \p{Cs}, \p{Cn} and \x1A type of characters from a column.

In Informatica I was using reg_replace (col_name,'[\p{So}\p{Cs}\p{Cn}\x1A]',' ') function to filter out these kind of characters and replace them with a single space.

However, when migrating to RedShift the same regex is not working with REGEXP_REPLACE() as it does not support the Unicode blocks.

For e.g. this is the string :-

'INT¡®L n°1 di KONGRESSE Ä Ê Í EUROREGIONE. V.Ž?#'

From above string, I need to keep the Latin / French letters, alphanumeric & punctuations and just remove the other symbols.

The sample looks like you have [mojibake](https://en.wikipedia.org/wiki/Mojibake) input. — tripleee, Jan 11 '22 at 09:56

Wiktor Stribiżew · Answer 1 · 2022-01-12T07:47:04.090

0

You can use

[^À-Ž[:punct:][:alnum:]]

It matches any single char other than letters from À to Ž, punctuation and alphanumerics. If [:alnum:] is Unicode-aware in your environment, you may remove À-Ž.

It also appears that you still can use a PCRE regex with Amazon Redshift REGEXP_REPLACE provided you pass the p parameters option:

REGEXP_REPLACE(col_name,'[\\p{So}\\p{Cs}\\p{Cn}\\x1A]+',' ', 1, 'p')

The backslashes need doubling, 1 param is the position in string to start matching from and the p option tells REGEXP_REPLACE to use PCRE regex syntax.

edited Jan 12 '22 at 07:47

answered Jan 11 '22 at 08:47

Wiktor Stribiżew

607,720
39
448
563

1

Hi Wiktor, Thanks for your answer. However, using this pattern also it mistakenly matched the alphabets 'So', 'Cs', 'Cn' and not being used as it's supposed to. – Pranay Purohit Jan 11 '22 at 13:50
@PranayPurohit Does `'[\x1A]+'` or `'[\\x1A]+'` remove the `¡` char? It is possible to create a specific character class if this works. It is also important to check if `'[\u001A]+'` / `'[\\u001A]+'` works – Wiktor Stribiżew Jan 11 '22 at 13:52
none of those are working not sure why even with double backslashes. Hence, came up with this pattern combination. `[^À-Ž[:punct:][:alnum:]]` – Pranay Purohit Jan 12 '22 at 06:57
@PranayPurohit Ok, maybe it is some versioning issue. I added the pattern with an explanation to the answer. – Wiktor Stribiżew Jan 12 '22 at 10:44

Unicode regex expression to POSIX expression conversion to support in Redshift/ Postgresql

1 Answers1