4

I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).

Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.

I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):

SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence

We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.

I there a way to achieve something similar?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
KumZ
  • 565
  • 12
  • 20
  • 1
    (Assuming utf8.) Do you want to allow *any* characters encoded with 3 bytes or more? \u0020 is encoded with 2 bytes, \u00A0 is encoded with 3 bytes, the rest of the characters listed in your link are encoded with 4 characters. Unless you have Chinese characters (or similar) you might want to trim *all* characters encoded with 4 bytes .... – Erwin Brandstetter Aug 07 '20 at 13:49
  • 1
    Closely related: https://stackoverflow.com/a/22701212/939860 – Erwin Brandstetter Aug 07 '20 at 13:50

3 Answers3

4

Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:

'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'

So use:

SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));

Note: trim() comes after regexp_replace(), so it covers converted spaces.

It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.

We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.

Consider this demo:

SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
     , '\u' || lpad(to_hex(d), 4, '0') AS unicode
     , chr(d) ~ '\s' AS in_posix_space_class
     , chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM  (
   -- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
   -- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
   SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
   UNION ALL
   SELECT generate_series (8192, 8202) AS dec  -- UNICODE "Spaces"
   UNION ALL
   SELECT generate_series (8203, 8207) AS dec  -- First 5 space-like UNICODE "Format characters"
   ) t(d)
ORDER  BY d;
 decimal | hex  |  glyph   | unicode | in_posix_space_class | in_custom_class 
---------+------+----------+---------+----------------------+-----------------
       9 | 9    |          | \u0009  | t                    | t
      32 | 20   |          | \u0020  | t                    | t
     160 | a0   |          | \u00a0  | f                    | t
    5760 | 1680 |          | \u1680  | t                    | t
    6158 | 180e | ᠎        | \u180e  | f                    | t
    8192 | 2000 |          | \u2000  | t                    | t
    8193 | 2001 |          | \u2001  | t                    | t
    8194 | 2002 |          | \u2002  | t                    | t
    8195 | 2003 |          | \u2003  | t                    | t
    8196 | 2004 |          | \u2004  | t                    | t
    8197 | 2005 |          | \u2005  | t                    | t
    8198 | 2006 |          | \u2006  | t                    | t
    8199 | 2007 |          | \u2007  | f                    | t
    8200 | 2008 |          | \u2008  | t                    | t
    8201 | 2009 |          | \u2009  | t                    | t
    8202 | 200a |          | \u200a  | t                    | t
    8203 | 200b | ​        | \u200b  | f                    | t
    8204 | 200c | ‌        | \u200c  | f                    | t
    8205 | 200d | ‍        | \u200d  | f                    | t
    8206 | 200e | ‎        | \u200e  | f                    | t
    8207 | 200f | ‏        | \u200f  | f                    | t
    8239 | 202f |          | \u202f  | f                    | t
    8287 | 205f |          | \u205f  | t                    | t
    8288 | 2060 | ⁠        | \u2060  | f                    | t
   12288 | 3000 |         | \u3000  | t                    | t
   65279 | feff |         | \ufeff  | f                    | t
(26 rows)

Tool to generate the character class:

SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM  (
   SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
   UNION ALL
   SELECT generate_series (8192, 8202)
   UNION ALL
   SELECT generate_series (8203, 8207)
   ) t(d)
WHERE  chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]

db<>fiddle here

Related, with more explanation:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
2

You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:

REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')

It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Seems working thanks. I've already tried something like than but only with `\u00A0` but written as `\uA0` which Postgres refused, that was my mistake! ;) – KumZ Aug 07 '20 at 14:29
  • 1
    @KumZ You may check all supported escape sequences (also called "character-entry escapes") in the [PostgreSQL regex documentation](https://www.postgresql.org/docs/current/functions-matching.html#POSIX-CHARACTER-ENTRY-ESCAPES-TABLE). – Wiktor Stribiżew Aug 07 '20 at 14:31
0

Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:

[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]

And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")

-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
    mytable
SET
    text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
    text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';

-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '%  %';

-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';