Questions tagged [character-properties]

character-properties are a set of attributes supplied by the Unicode Standard. For each character contained in it, many properties are specified in relation to processes or algorithms that interpret them, in order to implement the character behavior.

The Unicode Standard, on top of defining the encoding of characters, also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names.

More information can be found on Wikipedia, in the official Unicode Standard as well as in this Unicode Technical Report.

92 questions
8
votes
2 answers

Perl: How to match FULLWIDTH LATIN SMALL

I am using listadmin to manage many mailman-based mailing lists. I have a long list of subjects and from addresses set up to block spam. Recently, I received smarter spam in the sense that it uses nice-looking Unicode characters, eg: Subject: Al l…
Frederick Nord
  • 1,246
  • 1
  • 14
  • 31
7
votes
2 answers

Iterating through Unicode codepoints character by character

I've got a series of Unicode codepoints. What I really need to do is iterate through these codepoints as a series of characters, not a series of codepoints, and determine properties of each individual character, e.g. is a letter, whatever. For…
Puppy
  • 144,682
  • 38
  • 256
  • 465
7
votes
2 answers

Regular expression to match ASCII and Unicode letters

Recently I discovered, to my surprise, that JavaScript has no built-in support for Unicode regular expressions. So how can I test a string for letters only, Unicode or ASCII?
Thomas
  • 4,641
  • 13
  • 44
  • 67
7
votes
1 answer

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've…
Alterscape
  • 1,526
  • 1
  • 17
  • 34
7
votes
1 answer

How to specify Regexp for unicode cyrillic characters in Ruby 1.9

#coding: utf-8 str2 = "asdfМикимаус" p str2.encoding # p str2.scan /\p{Cyrillic}/ #found all cyrillic characters str2.gsub!(/\w/u,'') #removes only latin characters puts str2 The question is why \w ignore cyrillic characters? I…
user326922
  • 103
  • 1
  • 5
7
votes
4 answers

How do I get a list of all Unicode characters that have a given property?

Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/). I have looked at Unicode::UCD,…
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
6
votes
1 answer

Matching case sensitive unicode strings with regular expressions in Python

Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like re.compile(r"[a-z][A-Z]") Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'. Tried…
repoman
  • 3,485
  • 2
  • 16
  • 15
6
votes
2 answers

How to check which language supports which Support Level in Unicode Regular Expressions?

The various levels of Unicode Regular Expression support are described in UTS#18. Is there a way to to have a few tests for every requirement, so it is possible to port the tests to the language in question, run them and gather the results? Do other…
soc
  • 27,983
  • 20
  • 111
  • 215
6
votes
3 answers

Match C# Unicode Identifier using Regex

What is the right way to match a C# identifier, specifically a property or field name, using .Net Regex patterns? Background. I used to use the ASCII centric @"[_a-zA-Z][_a-zA-Z0-9]*" But now unicode uppercase and lowercase characters are legit,…
Max Yaffe
  • 1,317
  • 1
  • 14
  • 26
5
votes
4 answers

Spilt String using Unicode delimiter

I need to split a string with "-" as delimiter in java. Ex: "Single Room - Enjoy your stay" I have the same data coming in english and german depending on locale . Hence I cannot use the usual string.split("-") . The unicode for "-" character is…
Bhavya
  • 71
  • 2
  • 5
5
votes
2 answers

How to properly write regex for unicode first name in Java?

I need to write a regular expression so I could replace the invalid characters in user's input before sending it further. I think i need to use string.replaceAll("regex", "replacement") to do that. The particular line of code should replace all…
Rihards
  • 10,241
  • 14
  • 58
  • 78
5
votes
4 answers

List of Unicode alphabetic characters

I need the list of ranges of Unicode characters with the property Alphabetic as defined in http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic. However, I cannot find them in the Unicode Character Database no matter how I search for them.…
thSoft
  • 21,755
  • 5
  • 88
  • 103
5
votes
2 answers

Java regex for any symbol?

Is there a regex which accepts any symbol? EDIT: To clarify what I'm looking for.. I want to build a regex which will accept ANY number of whitespaces and the it must contain atleast 1 symbol (e.g , . " ' $ £ etc.) or (not exclusive or) at least 1…
Skizit
  • 43,506
  • 91
  • 209
  • 269
5
votes
3 answers

How to mark all CJK text in a document?

I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample…
Village
  • 22,513
  • 46
  • 122
  • 163
4
votes
3 answers

Regular expression to allow all alphabet characters plus unicode characters

I need a regular expression to allow all alphabet characters plus Greek/German alphabet in a string but replace those symbols ?,&,^,". with * I skipped the list with characters to escape to made the question simple. I really want to see how to…
Panos Kalatzantonakis
  • 12,525
  • 8
  • 64
  • 85