4

What are the unicode groups and block ranges that can be specified in character class \p{name}?

e.g.

\p{IsGreek}

Where Is the list of names & description available?

tchrist
  • 78,834
  • 30
  • 123
  • 180
ThinkingMonkey
  • 12,539
  • 13
  • 57
  • 81

2 Answers2

5

Regular-Expressions.info has lists.

You can also ask the man pages of PCRE itself:

Sets of Unicode characters are defined as belonging to certain scripts. A character from one of these sets can be matched using a script name. For example:

\p{Greek}
\P{Han}

Those that are not part of an identified script are lumped together as "Common". The current list of scripts is:

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, Yi.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • I actually found the same link just now :). Thanks though. – ThinkingMonkey Jan 25 '12 at 12:34
  • Thanks for pointing the `man page` hadn't spotted the `Unicode character properties` section. – ThinkingMonkey Jan 25 '12 at 12:38
  • Unfortunately that list is wrong. `Adlam` to `Zanabazar_Square` is missing. – Anon Apr 04 '19 at 09:41
  • @Akiva: Perhaps you should take that up with PCRE lagging behind with Unicode support (or not updating their documentation). – Joey Apr 04 '19 at 09:46
  • @Joey Ive been searching for a discovery method that is not human error prone. Even something such as https://github.com/google/re2/wiki/Syntax gave me some false positives with `QRegularExpression`, not to mention `regex101`. I would imagine the ideal way would be to dump source files for the various implementations concerned, because it is going to vary greatly between languages/engines out there. – Anon Apr 04 '19 at 11:12
2

Here you can find a list of the Unicode Character Properties that you can specify in the brackets: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Categories

Or you can match Unicode Blocks or Scripts, you can find information about that here: http://www.regular-expressions.info/unicode.html#block and http://www.regular-expressions.info/unicode.html#script.

entropid
  • 6,130
  • 5
  • 32
  • 45