Questions tagged [character-properties]

character-properties are a set of attributes supplied by the Unicode Standard. For each character contained in it, many properties are specified in relation to processes or algorithms that interpret them, in order to implement the character behavior.

The Unicode Standard, on top of defining the encoding of characters, also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names.

More information can be found on Wikipedia, in the official Unicode Standard as well as in this Unicode Technical Report.

92 questions
15
votes
2 answers

What is the {L} Unicode category?

I came across some regular expressions that contain [^\\p{L}]. I understand that this is using some form of a Unicode category, but when I checked the documentation, I found only the following "L" categories: Lu Uppercase letter …
uTubeFan
  • 6,664
  • 12
  • 41
  • 65
13
votes
7 answers

Regex for names with special characters (Unicode)

Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z], leaving characters out that i need to accept to. I…
Kristoffer la Cour
  • 2,591
  • 3
  • 25
  • 36
12
votes
2 answers

Javascript unicode (greek) regular expressions

I would like to use this regular expression new RegExp("\b"+pat+"\b") in greek text but the "\b" metacharacter supports only ASCII characters. I tried XregExp library but i didnt manage to solve the issue. Any suggestions would be greatly…
kylito
  • 121
  • 1
  • 4
12
votes
1 answer

Efficiently list all characters in a given Unicode category

Often one wants to list all characters in a given Unicode category. For example: List all Unicode whitespace, How can I get all whitespaces in UTF-8 in Python? Characters with the property Alphabetic It is possible to produce this list by…
Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
10
votes
5 answers

How to validate both Chinese (unicode) and English name?

I have a multilingual website (Chinese and English). I like to validate a text field (name field) in javascript. I have the following code so far. var chkName = /^[characters]{1,20}$/; if( chkName.test("[name value goes here]") ){ …
Moon
  • 22,195
  • 68
  • 188
  • 269
10
votes
2 answers

How to determine if a character is a Chinese character

How to determine if a character is a Chinese character using ruby?
HelloWorld
  • 7,156
  • 6
  • 39
  • 36
10
votes
9 answers

Python: Split unicode string on word boundaries

I need to take a string, and shorten it to 140 characters. Currently I am doing: if len(tweet) > 140: tweet = re.sub(r"\s+", " ", tweet) #normalize space footer = "… " + utils.shorten_urls(post['url']) avail = 140 - len(footer) words…
Paul Tarjan
  • 48,968
  • 59
  • 172
  • 213
10
votes
1 answer

Regular expression to match boundary between different Unicode scripts

Regular expression engines have a concept of "zero width" matches, some of which are useful for finding edges of words: \b - present in most engines to match any boundary between word and non-word characters \< and \> - present in Vim to match only…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
9
votes
3 answers

Latin Characters check

there are some similar questions out there, but none that are quite the same or that have an answer that works for me. I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese,…
CompanyDroneFromSector7G
  • 4,291
  • 13
  • 54
  • 97
8
votes
3 answers

Scanning for Unicode Numbers in a string with \d

According to the Oniguruma documentation, the \d character type matches: decimal digit char Unicode: General_Category -- Decimal_Number However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits…
Phrogz
  • 296,393
  • 112
  • 651
  • 745
8
votes
3 answers

POSIX character equivalents in Java regular expressions

I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]]. But Java doesn't support the POSIX classes [=a=], [=e=] etc. How can I do this? More precisely, is there a way to not use US-ASCII?
Stephan
  • 41,764
  • 65
  • 238
  • 329
8
votes
4 answers

regular expression containing unicode words

I'd like to match all strings containing a certain word. like: String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$) however, the Pattern class doesn't compile it: java.util.regex.PatternSyntaxException: Unmatched closing ')' near index…
Frost
  • 3,786
  • 5
  • 23
  • 29
8
votes
2 answers

Obtaining unicode characters of a language in Java

Is there any way in Java so that I can obtain all the Unicode characters of a particular language (for example Bengali or Arabic)?
Muhammad Asaduzzaman
  • 1,201
  • 3
  • 19
  • 33
8
votes
1 answer

Replace Unicode Control Characters

I need to replace all special control character in a string in Java. I want to ask the Google maps API v3, and Google doesn't seems to like these characters. Example:…
Cyril Gandon
  • 16,830
  • 14
  • 78
  • 122
8
votes
5 answers

How do I match only fully-composed characters in a Unicode string in Perl?

I'm looking for a way to match only fully composed characters in a Unicode string. Is [:print:] dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ',…
dreamlax
  • 93,976
  • 29
  • 161
  • 209