Questions tagged [character-properties]

character-properties are a set of attributes supplied by the Unicode Standard. For each character contained in it, many properties are specified in relation to processes or algorithms that interpret them, in order to implement the character behavior.

The Unicode Standard, on top of defining the encoding of characters, also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names.

More information can be found on Wikipedia, in the official Unicode Standard as well as in this Unicode Technical Report.

92 questions
4
votes
2 answers

What are the `unicode groups` and `block ranges` that can be specified in `\p{name}`?

What are the unicode groups and block ranges that can be specified in character class \p{name}? e.g. \p{IsGreek} Where Is the list of names & description available?
ThinkingMonkey
  • 12,539
  • 13
  • 57
  • 81
4
votes
1 answer

Properties of combining diacritics

For combining diacritics, are they counted as letters? Since, as far as I know, they can only combine with other letters in well-formed Unicode. The ICU function to determine if a Unicode codepoint is a letter only takes one codepoint, so for any…
Puppy
  • 144,682
  • 38
  • 256
  • 465
4
votes
2 answers

Enumerate a character's Unicode properties in Ruby?

Is there any way to enumerate all of a character's Unicode properties in Ruby? I can use Ruby 1.9's Regexp class to test whether a given character has a particular property (e.g., some_char =~ /\p{P}/ to test whether some_char is punctuation,…
Steven Bedrick
  • 663
  • 2
  • 8
  • 16
4
votes
1 answer

@Pattern with Unicode script \\p{L}* doesn't work

I have problem with javax.validation.constraints.Pattern @Pattern validation. @Pattern(regexp = "\\p{L}*", message = "Msg") private String name; When I'm trying to input any text it doesn't work. When I used: @Pattern(regexp = "[a-zA-Z]*",…
4
votes
3 answers

Mathematica regular expressions on unicode strings

This was a fascinating debugging experience. Can you spot the difference between the following two lines? StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"] StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"] They do very different…
dreeves
  • 26,430
  • 45
  • 154
  • 229
4
votes
5 answers

Match unicode in ply's regexes

I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough: t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*" In my markup language parser I match…
Cheery
  • 24,645
  • 16
  • 59
  • 83
4
votes
2 answers

Searching unicode text using regex

Searching a file which is written in Hindi(Devanagri) (UTF-16) gave rise to the following problem. The file contains: त्रास ततत जुग नींद ना हा बु Note that the first char 'त्र' is a multiple code point of त + ् + र Now while searching for 'त'…
user162703
3
votes
4 answers

how to use unicode character groups in javascript's regexs?

there is a way to use patterns like "\p{L}" in javascript, natively? (i suppose that is a perl-compatible syntax) I'm interested firstly in firefox support, and webkit, possibly
user652649
3
votes
3 answers

How can I find out how is a punctuation character form in UTF 8?

I have a set of characters like ., !, ?, ;, (space) and a string, which may or may not be UTF 8 (any language). Is there a easy way to find out if the string has one of the character set above? For example: 这是一个在中国的字符串。 which translates to This is…
Alex
  • 66,732
  • 177
  • 439
  • 641
3
votes
5 answers

Validating a Unicode Name

In ASCII, validating a name isn't too difficult: just make sure all the characters are alphabetical. But what about in Unicode (utf-8) ? How can I make sure there are no commas or underscores (outside of ASCII scope) in a given string? (ideally in…
Gilbert
  • 901
  • 1
  • 10
  • 22
3
votes
2 answers

Matching a Unicode "name" with a JavaScript Regular Expression

In JavaScript we can match individual Unicode codepoints or codepoint ranges by using the Unicode escape sequences, e.g.: "A".match(/\u0041/) // => ["A"] "B".match(/[\u0041-\u007A]/) // => ["B"] But how could we create a regular expression to match…
maerics
  • 151,642
  • 46
  • 269
  • 291
3
votes
1 answer

Unicode regexp to match line-breaks?

I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this: ~^[\p{L}\p{M}\p{N} ]+$~u This pattern works fine until the user puts a…
Booya
  • 31
  • 1
  • 6
3
votes
2 answers

Regex - Unicode Properties Reference and Examples

I feel lost with the Regex Unicode Properties presented by RegexBuddy, I cannot distinguish between any of the Number properties and the Math symbol property only seems to match + but not -, *, /, ^ for instance. Is there any documentation /…
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
3
votes
2 answers

Latin char in Javascript regexp

How can i inlude the use of latin chars like ČčĆ抚Đđ in this javascript regexp var regex = new RegExp('\\b' + this.value, "i"); UPDATE: I have this code for filtering checkbox label, but it doesnt work well when there is an input with Č č…
user2406735
  • 247
  • 1
  • 6
  • 21
3
votes
5 answers

Incrementing a character in Java explanation

I have a Java fragment that looks like this: char ch = 'A'; System.out.println("ch = " + ch); which prints: A then when I do this ch++; // increment ch System.out.println("ch =" + ch); it now prints: B I also tried it with Z and…
Þaw
  • 2,047
  • 4
  • 22
  • 39