'\u00BD' # ½
'\u00B2' # ²
I am trying to understand isdecimal() and isdigit() better, for this its necessary to understand unicode numeric value properties. How would I see the numerical value property of, for example, the above two unicodes.
'\u00BD' # ½
'\u00B2' # ²
I am trying to understand isdecimal() and isdigit() better, for this its necessary to understand unicode numeric value properties. How would I see the numerical value property of, for example, the above two unicodes.
To get the 'numeric value' contained in the character, you could use unicodedata.numeric()
function:
>>> import unicodedata
>>> unicodedata.numeric('\u00BD')
0.5
Use the ord()
function to get the integer codepoint, optionally in combination with format()
to produce a hexadecimal value:
>>> ord('\u00BD')
189
>>> format(ord('\u00BD'), '04x')
'00bd'
You can get access to the character property with unicodedata.category()
, which you'd then need to check against the documented categories:
>>> unicodedata('\u00DB')
'No'
where 'No'
stands for Number, Other.
However, there are a series of .isnumeric() == True
characters in the category Lo
; the Python unicodedata
database only gives you access to the general category and relies on str.isdigit()
, str.isnumeric()
, and unicodedata.digit()
, unicodedata.numeric()
, etc. methods to handle the additional categories.
If you want a precise list of all numeric Unicode characters, the canonical source is the Unicode database; a series of text files that define the whole of the standard. The DerivedNumericTypes.txt
file (v. 6.3.0) gives you a 'view' on that database specific the numeric properties; it tells you at the top how the file is derived from other data files in the standard. Ditto for the DerivedNumericValues.txt
file, listing the exact numeric value per codepoint.
the docs explicitly specify the relation between the methods and Numeric_Type
property.
def is_decimal(c):
"""Whether input character is Numeric_Type=decimal."""
return c.isdecimal() # it means General Category=Decimal Number in Python
def is_digit(c):
"""Whether input character is Numeric_Type=digit."""
return c.isdigit() and not c.isdecimal()
def is_numeric(c):
"""Whether input character is Numeric_Type=numeric."""
return c.isnumeric() and not c.isdigit() and not c.isdecimal()
Example:
>>> for c in '\u00BD\u00B2':
... print("{}: Numeric: {}, Digit: {}, Decimal: {}".format(
... c, is_numeric(c), is_digit(c), is_decimal(c)))
...
½: Numeric: True, Digit: False, Decimal: False
²: Numeric: False, Digit: True, Decimal: False
I'm not sure Decimal Number and Numeric_Type=Decimal will always be identical.
Note: '\u00B2'
is not decimal because superscripts are explicitly excluded by the standard, see 4.6 Numerical Value (Unicode 6.2).