2

perluniprops lists the Unicode properties of the version of Unicode it supports. For Perl 5.32.1, that's Unicode 13.0.0.

You can obtain a list of the characters that match a category using Unicode::Tussle's unichars.

unichars '\p{Close_Punctuation}' 

And the help:

$ unichars --help
Usage:
    unichars [*options*] *criterion* ...

    Each criterion is either a square-bracketed character class, a regex
    starting with a backslash, or an arbitrary Perl expression. See the
    EXAMPLES section below.

    OPTIONS:

     Selection Options:

        --bmp           include the Basic Multilingual Plane (plane 0) [DEFAULT]
        --smp           include the Supplementary Multilingual Plane (plane 1)
        --astral    -a  include planes above the BMP (planes 1-15)
        --unnamed   -u  include various unnamed characters (see DESCRIPTION)
        --locale    -l  specify the locale used for UCA functions

     Display Options:

        --category  -c  include the general category (GC=)
        --script    -s  include the script name (SC=)
        --block     -b  include the block name (BLK=)
        --bidi      -B  include the bidi class (BC=)
        --combining -C  include the canonical combining class (CCC=)
        --numeric   -n  include the numeric value (NV=)
        --casefold  -f  include the casefold status
        --decimal   -d  include the decimal representation of the code point

     Miscellaneous Options:

        --version   -v  print version information and exit
        --help      -h  this message
        --man       -m  full manpage
        --debug     -d  show debugging of criteria and examined code point span

     Special Functions:

         $_    is the current code point
         ord   is the current code point's ordinal

         NAME is charname::viacode(ord)
         NUM is Unicode::UCD::num(ord), not code point number
         CF is casefold->{status}
         NFD, NFC, NFKD, NFKC, FCD, FCC  (normalization)
         UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)

         Singleton, Exclusion, NonStDecomp, Comp_Ex
         checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
         NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE

Other than reading the list of categories from the webpage, is there a way to programmatically get all the possible \p{...} categories?

ikegami
  • 367,544
  • 15
  • 269
  • 518
alvas
  • 115,346
  • 109
  • 446
  • 738
  • No simple way, I imaging. Check what `uniprops` does. – ikegami Apr 17 '21 at 19:16
  • 1
    It literally [parses](https://metacpan.org/release/Unicode-Tussle/source/script/uniprops#L618) `perluniprops.pod` – ikegami Apr 17 '21 at 19:20
  • Note that Unicode database is freely available, so you could build the list of properties yourself if you know the version. [13.0.0](https://www.unicode.org/Public/13.0.0/ucd/) – ikegami Apr 17 '21 at 19:49
  • However, core module [Unicode::UCD](https://metacpan.org/pod/Unicode::UCD) should be able to provide most if not all of the same info for the version of Unicode used by the current Perl. – ikegami Apr 17 '21 at 19:52
  • 1
    **What are you trying to accomplish?** – ikegami Apr 17 '21 at 19:57
  • I'm trying to get the full list of categories on perluniprops. – alvas Apr 19 '21 at 00:17
  • Argh... That's `perluniprops.pod` parsing looks nasty =( – alvas Apr 19 '21 at 00:18
  • uniprops doesn't list categories. What are you trying to accomplish? – ikegami Apr 19 '21 at 02:01
  • I'm porting code from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl to https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py so I wanted to make sure my implementation is as close to the Perl one when it uses the uniprops categories. – alvas Apr 19 '21 at 02:12
  • 1
    Does that program actually accept `\p{}` expressions as input? Maybe I missed it, but it doesn't seem to be the case, so why would knowing the list of unicode properties help you? What you need to know what what characters each of the properties you already know match. Except you don't really. Use the [*regex*](https://pypi.org/project/regex/) module instead of the *re* module, and you'll be using the real Unicode properties too. The newest even uses Unicode 13.0.0 just like the latest Perl. – ikegami Apr 19 '21 at 06:37
  • 1
    (I do appreciate that you took a stab at figuring out what the program does rather than just saying "I'm trying to translate this.") – ikegami Apr 19 '21 at 19:36

1 Answers1

2

From the comments, I believe you are trying to port a Perl program using \p regex properties to Python. You don't need a list of all categories (whatever that means); you just need to know what Code Points each of the property used by the program matches.

Now, you could get the list of Code Points from the Unicode database. But a much simpler solution is to use Python's regex module instead of the re module. This will give you access to the same Unicode-defined properties that Perl exposes.

The latest version of the regex module even uses Unicode 13.0.0 just like the latest Perl.


Note that the program uses \p{IsAlnum}, a long way of writing \p{Alnum}. \p{Alnum} is not a standard Unicode property, but a Perl extension. It's the union of Unicode properties \p{Alpha} and \p{Nd}. I don't know know if the regex module defines Alnum identically, but it probably does.

ikegami
  • 367,544
  • 15
  • 269
  • 518