regex expression to include german characters

Question

I have a string that looks like this:

{"id":"1","name":"simpleword","type":"test"},{"id":"123","name":"f23ÜÜ","type":"prod"}

I want to do a REGEXP_SUBSTR_ALL such that I can extract all "name" values into a list.

Something like this allows me to extract some names but not all. This is because some names include german characters like "Ü" which are not included with '\w+'

SELECT REGEXP_SUBSTR_ALL(ARRAY_TO_STRING(REGEXP_SUBSTR_ALL((ARRAY_TO_STRING(REGEXP_SUBSTR_ALL(ARTICLE_TAGS, '"name\\W+\\w+"'),',')), ':"\\w+'),','),'\\w+') FROM TABLE

For example, the expression above would give me this output:

[
    "simpleword"
]

while my desired output is this:

[
    "simpleword", "f23ÜÜ"
]

Felipe Hoffa · Accepted Answer · 2022-09-27T21:42:07.417

1

Just match everything that is not a ":

with data(s) as (
    select $${"id":"1","name":"simpleword","type":"test"},{"id":"123","name":"f23ÜÜ","type":"prod"}$$
)

select regexp_substr_all(s, 'name":"([^"]*)"', 1, 1, '', 1)
from data

-- [   "simpleword",   "f23ÜÜ" ]

Also an alternative (as discussed in the comments, with the 'e' regex_parameters:

with data(s) as (
    select $${"id":"1","name":"simpleword","type":"test"},{"id":"123","name":"f23ÜÜ","type":"prod"}$$
)

select regexp_substr_all(s, 'name":"([^"]*)"', 1, 1, 'e')
from data

edited Sep 27 '22 at 21:42

answered Sep 26 '22 at 14:56

Felipe Hoffa

54,922
16
151
325

Could you also please explain the regex? – x89 Sep 26 '22 at 20:20
what's the 1, 1, '', 1 part doing? – x89 Sep 27 '22 at 07:25
From the docs (https://docs.snowflake.com/en/sql-reference/functions/regexp_substr_all.html#syntax): `REGEXP_SUBSTR_ALL( , [ , [ , [ , [ , ] ] ] ] )` -- we need group 1 (between parenthesis), which is the last parameter – Felipe Hoffa Sep 27 '22 at 08:09
i understand the subject and pattern. But how's the position helping here? Isn't it position 1 by default too? And if we write "1" as occurence then how do we get all occurrences of the names? and what '' is doing as the regex parameter? its not in the accepted items in the docs – x89 Sep 27 '22 at 08:23
They are all placeholders (default), so I can get that last 1 in (because that's the 1 I need for `group_num`). Also I took a second look, and I like how `'e'` works in this case. See the updated answer. – Felipe Hoffa Sep 27 '22 at 21:39

regex expression to include german characters

1 Answers1