How to extract words entirely written in uppercase with accents (Diacritics) with a Google Sheet REGEXEXTRACT formula?

Question

Ok,

it looks simple but when the words start or finish with an accents it is the mess. I've looked on Stack Overflow and others and haven't really found a way to solve this problem. I would like, to be able with a Google sheet formula, to extract from a cell, words only built with the ASCII characters that follow: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,À,Á,Â,Ã,Ä,Å,Æ,Ç,È,É,Ê,Ë,Ì,Í,Î,Ï,Ð,Ñ,Ò,Ó,Ô,Õ,Ö,Ø,Ù,Ú,Û,Ü,Ý

For example with "Éléonorä-Camilliâ ÀLËMMNIÖ DE SANTORINÕ" or "ÀLËMMNIÖ DE SANTORINÕ Éléonorä Camilliâ" the result has to be the same "ÀLËMMNIÖ DE SANTORINÕ"

This formula works when no accent all: =REGEXEXTRACT(A2;"\b[A-Z]+(?:\s+[A-Z]+)*\b")

These formula work sometimes when the names are easy.

=REGEXEXTRACT(A2;"\b[A-Ý]+(?:\s+[A-Ý]+)*\b")

=REGEXEXTRACT(A2;"\B[A-Ý]+(?:\S+[A-Ý]+)*\B")

Can anybody help me or give me some hint?

Wiktor Stribiżew · Answer 1 · 2020-12-17T20:41:57.787

0

It seems your expected matches are simply between whitespace or start/end of string. If you add a space before and after the cell value, you may simply extract all the chunks of whitespace-separated uppercase letter words between whitespaces, and the formula will boil down to

=REGEXEXTRACT(" " & A2 & " "; "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s")

See the Google sheets demo:

Regex details:

\s - a whitespace
([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*) - Group 1 (the actual value returned by REGEXEXTRACT): one or more uppercase letters from the specified ranges followed with zero or more repetitions of one or more whitespace and then one or more uppercase letters
\s - a whitespace.

You may use an ARRAYFORMULA, as well:

=ARRAYFORMULA(IFERROR(REGEXEXTRACT(" " & A:A & " ", "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s"),""))

edited Dec 17 '20 at 20:41

answered Dec 16 '20 at 13:09

Wiktor Stribiżew

607,720
39
448
563

1

Hello and thanks for your help. The formula works really well in RE2 Regex Demo with the worst examples I have, but I wonder if Google Sheets is really compatible with RE2. I get a wonderful #REF! , "(?:\P{L}|^)(\p{Lu}+(?:\s+\p{Lu}+)*)(?:\P{L}|$)" is not a valid regular expression... – Antonin Thuillier Dec 16 '20 at 14:58
1

@AntoninThuillier Oh, that is right, "*Google products use RE2 for regular expressions. Google Sheets supports RE2 **except Unicode character class matching***" I replaced the regex with a new one, but I think there is actually another bug here, related to the use of non-capturing groups in the Google RE2 implementation. – Wiktor Stribiżew Dec 16 '20 at 15:22
@AntoninThuillier Sorry, it seems the non-capturing group bug persists. Thus, I suggest using a custom regex extracting function with Google Apps Script using my first regex, `/(?<!\p{L})\p{Lu}+(?:\s+\p{Lu}+)*(?!\p{L})/u` – Wiktor Stribiżew Dec 16 '20 at 15:59
Hello and thanks again it is crazy isn'it ? Looking to "(?:[^A-Za-zÀ-ÖØ-Ýà-öø-ý\n]|^)([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)(?:[^A-Za-zÀ-ÖØ-Ýà-öø-ý]|$)" (what a formula by the way !) it works really well except the fact it also matches with the last uppercase of the next word. Example the match of "CHALOBAH Trevoh" is "CHALOBAH T". I cannot use a script, because I have to share and duplicate this sheet a lot. – Antonin Thuillier Dec 16 '20 at 16:25
@AntoninThuillier The fact is that that letter is not *captured*, and thus must not be returned as REGEXEXTRACT only returns the captured values if capturing group is defined. This is a bug. Thus, you need a workaround, see the top of my post. – Wiktor Stribiżew Dec 16 '20 at 16:35
Ouch ! Do you think I can find my way by reversing a regex to avoid this group ? – Antonin Thuillier Dec 16 '20 at 17:21
@AntoninThuillier RE2 library does not support lookarounds. With the non-capturing group bug, you are left with no tool to actually exclude anything from the match. You might think of a REGEXREPLACE approach though. – Wiktor Stribiżew Dec 16 '20 at 17:28
@AntoninThuillier I see you actually do not need *word* boundaries, you want to use *whitespace* boundaries. It makes the task much simpler, see my updated solution. – Wiktor Stribiżew Dec 17 '20 at 20:31

score 0 · Answer 2 · answered Dec 20 '20 at 18:06

Supposing your sample name were in A2, this should work:

=TRIM(REGEXEXTRACT(A2&" ","([A-ZÀ-Ý ]+)\s"))

By appending a space to the end of the string first, we can then look for the [uppercase letter set or space] in any number up ending with a space. This rules out strings like "Éléonorä" and "Camilliâ" because those uppercase letters are not followed by a space.

Put a different way, the rule here says, "Grab as many uppercase letters or spaces in this set as possible, as long as you still have a space left over at the end." And since we appended a space to the end of the entire string, we can catch such groupings anywhere in the modified string.

player0 · Answer 3 · 2020-12-17T20:18:17.293

-1

use:

=ARRAYFORMULA(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
 IFERROR(REGEXEXTRACT(SPLIT(A1:A, " "), "["&TEXTJOIN("", 1, 
 UNIQUE(QUERY({UPPER(CHAR(ROW(65:1500))), LOWER(CHAR(ROW(65:1500)))}, 
 "select Col2 where Col1<>Col2")))&"]+")),,IFERROR(SPLIT(A1:A, " ")))),,9^9))))

or 10 characters shorter:

=INDEX(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
 IFERROR(REGEXEXTRACT(SPLIT(A:A; " "); "["&JOIN(; 
 UNIQUE(LOWER(QUERY(CHAR(ROUNDUP(SEQUENCE(1500; 2; 65)/2)); 
 "select Col1 where lower(Col1)<>upper(Col2)"))))&"]+"));;
 IFERROR(SPLIT(A:A; " "))));;9^9))))

works with all Europe-based alphabets and captures all diacritics out there. it can differentiate between:

LOWER

and

UPPER

edited Dec 17 '20 at 20:18

answered Dec 17 '20 at 01:29

player0

124,011
12
67
124

1

Thanks a lot Player0 ! It works really fine. I just had to "Europeanise" your incredible 1st formula. Tomorrow I will test you second shorter one. Do you see a way to filter the digits (I use Regexreplace) ? – Antonin Thuillier Dec 17 '20 at 19:04
@AntoninThuillier depends on example but digits shouldnt be a problem either – player0 Dec 17 '20 at 19:12

score -1 · Answer 4 · answered Dec 17 '20 at 18:37

Try this- backslashing the non A-Z characters.

[A-Z\À\Á\Â\Ã\Ä\Å\Æ\Ç\È\É\Ê\Ë\Ì\Í\Î\Ï\Ð\Ñ\Ò\Ó\Ô\Õ\Ö\Ø\Ù\Ú\Û\Ü\Ý]

If that fails you can encode each one of those letters like below: Look up for characters: https://www.w3schools.com/charsets/ref_utf_latin1_supplement.asp

[A-Z\u00C0\u00C1... and so on...]

How to extract words entirely written in uppercase with accents (Diacritics) with a Google Sheet REGEXEXTRACT formula?

4 Answers4

LOWER

UPPER