What are the `unicode groups` and `block ranges` that can be specified in `\p{name}`?

Question

What are the unicode groups and block ranges that can be specified in character class \p{name}?

e.g.

\p{IsGreek}

Where Is the list of names & description available?

score 5 · Accepted Answer · answered Jan 25 '12 at 12:32

Regular-Expressions.info has lists.

You can also ask the man pages of PCRE itself:

Sets of Unicode characters are defined as belonging to certain scripts. A character from one of these sets can be matched using a script name. For example:
\p{Greek}
\P{Han}
Those that are not part of an identified script are lumped together as "Common". The current list of scripts is:

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, Yi.

Thanks for pointing the `man page` hadn't spotted the `Unicode character properties` section. — ThinkingMonkey, Jan 25 '12 at 12:38
Unfortunately that list is wrong. `Adlam` to `Zanabazar_Square` is missing. — Anon, Apr 04 '19 at 09:41
@Akiva: Perhaps you should take that up with PCRE lagging behind with Unicode support (or not updating their documentation). — Joey, Apr 04 '19 at 09:46
@Joey Ive been searching for a discovery method that is not human error prone. Even something such as https://github.com/google/re2/wiki/Syntax gave me some false positives with `QRegularExpression`, not to mention `regex101`. I would imagine the ideal way would be to dump source files for the various implementations concerned, because it is going to vary greatly between languages/engines out there. — Anon, Apr 04 '19 at 11:12

score 2 · Answer 2 · answered Jan 25 '12 at 12:35

Here you can find a list of the Unicode Character Properties that you can specify in the brackets: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Categories

Or you can match Unicode Blocks or Scripts, you can find information about that here: http://www.regular-expressions.info/unicode.html#block and http://www.regular-expressions.info/unicode.html#script.

What are the `unicode groups` and `block ranges` that can be specified in `\p{name}`?

2 Answers2