How do I match only fully-composed characters in a Unicode string in Perl?

Question

I'm looking for a way to match only fully composed characters in a Unicode string.

Is [:print:] dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ', since it is not a control character, or is [:print:] always going to be ASCII codes 0x20 to 0x7E?

Is there any character class, including Perl REs, that can be used to match anything other than a control character? If [:print:] includes only characters in ASCII range I would assume [:cntrl:] does too.

score 6 · Accepted Answer · answered Oct 15 '08 at 05:27

6

echo あ| perl -nle 'BEGIN{binmode STDIN,":utf8"} print"[$_]"; print /[[:print:]]/ ? "YES" : "NO"'

This mostly works, though it generates a warning about a wide character. But it gives you the idea: you must be sure you're dealing with a real unicode string (check utf8::is_utf8). Or just check perlunicode at all - the whole subject still makes my head spin.

answered Oct 15 '08 at 05:27

Tanktalus

21,664
5
41
68

1

You can get rid of the ugly BEGIN{binmode STDIN, ":utf8"} kludge by supplying the option -CS on the command line. – moritz Oct 15 '08 at 06:43
... that will also make the warning go away, because it sets up STDOUT in the same way as STDIN. – moritz Oct 15 '08 at 06:50
That may not be as much of an option if the OP is writing a module to handle this instead of a standalone script. So I'm going to leave my solution, as well as your fix in the hopes the OP can figure out which one is better for his/her scenario. Thanks :-) – Tanktalus Oct 15 '08 at 13:35
This pattern is wrong. [[:print:]] will match "\x{3099}" which is not a fully-composed character! See my answer for a working pattern. – daxim Jan 07 '10 at 22:59

score 5 · Answer 2 · edited Nov 18 '10 at 00:29

I think you don't want or need locales for that but, but rather Unicode. If you have decoded a text string, \w will match word characters in any language, \d matches not just 0..9 but every Unicode digit etc. In regexes you can query Unicode properties with \p{PropertyName}. Particularly interesting for you might be \p{Print}. Here's a list of all the available Unicode character properties.

I wrote an article about the basics and subtleties of Unicode and Perl, it should give you a good idea on what to do that perl will recognize your string as a sequence of characters, not just a sequence of bytes.

Update: with Unicode you don't get language dependent behaviour, but instead sane defaults regardless of language. This may or may not be what you want, but for the distinction of priintable/control character I don't see why you'd need language dependent behaviour.

score 4 · Answer 3 · answered Jan 07 '10 at 23:12

\X matches a fully-composed character (sequence). Proof:

#!/usr/bin/env perl
use 5.010;
use utf8;
use Encode qw(encode_utf8);

for my $string (qw(あ ご ご), "\x{3099}") {
    say encode_utf8 sprintf "%s $string", $string =~ /\A \X \z/msx ? 'ok' : 'nok';
}

The test data are: a normal character, a pre-combined character, a combining character sequence and a combining character (which "doesn't count" on its own, a simplification of Chapter 3 of Unicode).

Substitute \X with [[:print:]] to see that Tanktalus' answer produces false matches for the last two cases.

score 2 · Answer 4 · answered Oct 15 '08 at 03:11

2

Yes, those expressions are locale dependant.

answered Oct 15 '08 at 03:11

Jonathan Leffler

730,956
141
904
1,278

Can you name an environment and/or regular expression implementation that allows [:print:] to respect a Japanese UTF-8 locale/encoding? I am using Perl in Linux with Japanese UTF-8 locale/encoding and it does not match Japanese character. – dreamlax Oct 15 '08 at 03:14

score 1 · Answer 5 · answered Oct 15 '08 at 03:26

1

You could always use the character class [^[:cntrl:]] to match non-control characters.

answered Oct 15 '08 at 03:26

Adam Rosenfield

390,455
97
512
589

1

This does not match Unicode control characters (in my environment setup and using Perl). There are Unicode control characters for changing text direction and so on. Using [^[:ctrnl:]] will match these Unicode ones but not ASCII ones. – dreamlax Oct 15 '08 at 04:03

How do I match only fully-composed characters in a Unicode string in Perl?

5 Answers5

Linked