I have the string "re\x{0301}sume\x{0301}"
(which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r"
(émusér). I can't use Perl's reverse
because it treats combining characters like "\x{0301}"
as separate characters, so I wind up getting "\x{0301}emus\x{0301}er"
( ́emuśer). How can I reverse the string, but still respect the combining characters?

- 18,769
- 10
- 104
- 133

- 64,182
- 22
- 135
- 226
5 Answers
You can use the \X special escape (match a non-combining character and all of the following combining characters) with split
to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join
them back together:
#!/usr/bin/perl
use strict;
use warnings;
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";

- 64,182
- 22
- 135
- 226
-
1For those confused (as I was at first) about why there are empty strings between the graphemes, it's because the `split` is inverted: it uses the data that's wanted as the separator. The empty string is what's "between" two graphemes. It's only by including the separator in the result that you get the graphemes mixed in with the "real" result -- a bunch of empty strings. An alternative (and slightly faster) method that avoids that is to use an `m//g` to capture the graphemes instead: `join '', reverse $original =~ /(\X)/g` – Michael Carman Aug 28 '09 at 16:59
-
2To clarify Michael's comment, when you use memory parenthesis in a regex you give to split, you trigger "separator retention mode". You get back the thing that goes between the parts you are splitting up. You don't need to do that however. The pattern (?=\X) does the same thing with no extra bits. Not that the empty string really matters that much for small strings. – brian d foy Aug 28 '09 at 19:04
-
You're right to point out "separator retention mode", thank you, that was helpful. However, (?=\X) is not equivalent. For proof, consider these two examples: split /(a)/, "abc" is not equivalent to split /(?=a)/, "abc" and split /(b+c)/, "abbcd" is not equivalent to split /(?=b+c)/, "abbcd" – Flimm Sep 16 '11 at 16:36
-
Indeed, those are not equivalent, but I wasn't using those. I was only talking about the particular thing I was using. – brian d foy Sep 16 '11 at 17:33
The best answer is to use Unicode::GCString, as Sinan points out
I modified Chas's example a bit:
- Set the encoding on STDOUT to avoid "wide character in print" warnings;
- Use a positive lookahead assertion (and no separator retention mode) in
split
(doesn't work after 5.10, apparently, so I removed it)
It's basically the same thing with a couple of tweaks.
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print <<HERE;
original: [$original]
wrong: [$wrong]
right: [$right]
HERE

- 1
- 1

- 129,424
- 31
- 207
- 592
-
Wow. I like perl, but that split expression is pretty magical. My first thought was "brute force": make a function to do what the split does -- return an list of strings, each entry of which represents a logical character. However you get that list (call it @x), the join( '', reverse( @x) ) part obviously follows, fortunately. – Roboprog Aug 28 '09 at 19:20
-
2Magical? How so? It's just a regex with no side effects and it only does exactly what you see. If you think that's magic, you haven't seen the real black arts of Perl. You might call it clever (although I wouldn't), but it's not magical. It's probably just something you haven't ever used. – brian d foy Aug 28 '09 at 19:57
-
I tried running this example using Perl v5.12.4 and it didn't work. Using /(\X)/ instead did. Out of interest, did this answer work in previous versions of Perl, or did we just miss the obvious? – Flimm Sep 16 '11 at 16:31
-
It looks like it works under 5.10 but not 5.12 or 5.14. I think that must be a new bug. – brian d foy Sep 16 '11 at 17:31
-
@briandfoy I am too lazy to look right now, did you file a bug about this? – Chas. Owens Jan 25 '12 at 11:53
You can use Unicode::GCString:
Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);
use Unicode::GCString;
my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };
say "$x -> $wrong";
say "$y -> $correct";
Output:
résumé -> ́emuśer résumé -> émusér

- 116,958
- 15
- 196
- 339
Perl6::Str->reverse
also works.
In the case of the string résumé
, you can also use the Unicode::Normalize
core module to change the string to a fully composed form (NFC
or NFKC
) before reverse
ing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.

- 12,710
- 1
- 41
- 63
Some of the other answers contain elements that don't work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'unicode_strings';
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";