Convert a word's characters into its ascii code list concisely in Raku

Question

I'm trying to convert the word wall into its ascii code list (119, 97, 108, 108) like this:

my @ascii="abcdefghijklmnopqrstuvwxyz";

my @tmp;
map { push @tmp, $_.ord if $_.ord == @ascii.comb.any.ord }, "wall".comb;
say @tmp;

Is there a way to use the @tmp without declaring it in a seperate line?
Is there a way to produce the ascii code list in one line instead of 3 lines? If so, how to do it?

Note that I have to use the @ascii variable i.e. I can't make use of the consecutively increasing ascii sequence (97, 98, 99 ... 122) because I plan to use this code for non-ascii languages too.

Are you aware of the [ords](https://docs.raku.org/type/Str#method_ords) method? — Elizabeth Mattijsen, Sep 07 '20 at 22:59
I have actually used the `ords` by accidentally adding an *s* to it and got undesired results until I realized that I've used `ords` To be honest, I haven't looked it up in the documentation but I guess its an array of the `ord` — Lars Malmsteen, Sep 09 '20 at 16:37

score 10 · Accepted Answer · answered Sep 07 '20 at 19:47

There are a couple of things we can do here to make it work.

First, let's tackle the @ascii variable. The @ sigil indicates a positional variable, but you assigned a single string to it. This creates a 1-element array ['abc...'], which will cause problems down the road. Depending on how general you need this to be, I'd recommend either creating the array directly:

my @ascii = <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my @ascii = 'a' .. 'z';
my @ascii = 'abcdefghijklmnopqrstuvwxyz'.comb;

or going ahead and handling the any part:

my $ascii-char = any <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my $ascii-char = any 'a' .. 'z';
my $ascii-char = 'abcdefghijklmnopqrstuvwxyz'.comb.any;

Here I've used the $ sigil, because any really specifies any single value, and so will function as such (which also makes our life easier). I'd personally use $ascii, but I'm using a separate name to make later examples more distinguishable.

Now we can handle the map function. Based on the above two versions of ascii, we can rewrite your map function to either of the following

{ push @tmp, $_.ord if $_ eq @ascii.any  }
{ push @tmp, $_.ord if $_ eq $ascii-char }

Note that if you prefer to use ==, you can go ahead and create the numeric values in the initial ascii creation, and then use $_.ord. As well, personally, I like to name the mapped variable, e.g.:

{ push @tmp, $^char.ord if $^char eq @ascii.any  }
{ push @tmp, $^char.ord if $^char eq $ascii-char }

where $^foo replaces $_ (if you use more than one, they map alphabetical order to @_[0], @_[1], etc).

But let's get to the more interesting question here. How can we do all of this without needing to predeclare @tmp? Obviously, that just requires creating the array in the map loop. You might think that might be tricky for when we don't have an ASCII value, but the fact that an if statement returns Empty (or () ) if it's not run makes life really easy:

my @tmp = map { $^char.ord if $^char eq $ascii-char }, "wall".comb;
my @tmp = map { $^char.ord if $^char eq @ascii.any  }, "wall".comb;

If we used "wáll", the list collected by map would be 119, Empty, 108, 108, which is automagically returned as 119, 108, 108. Consequently, @tmp is set to just 119, 108, 108.

Thank you for the answer. Me writing `@ascii` instead of `$ascii` is a typo but the compiler has not thrown any error for it so it ended up being there. — Lars Malmsteen, Sep 07 '20 at 20:18
Making use of the `if` statement's returning `Empty` or `()` is quite practical. — Lars Malmsteen, Sep 07 '20 at 20:39
@LarsMalmsteen I'm fairly certain allowing exactly what you're doing in `map` (or similarly in `for` loops) is why they do that :-) — user0721090601, Sep 07 '20 at 20:57

score 8 · Answer 2 · answered Sep 08 '20 at 14:58

Yes there is a much simpler way.

"wall".ords.grep('az'.ords.minmax);

Of course this relies on a to z being an unbroken sequence. This is because minmax creates a Range object based on the minimum and maximum value in the list.

If they weren't in an unbroken sequence you could use a junction.

"wall".ords.grep( 'az'.ords.minmax | 'AZ'.ords.minmax );

But you said that you want to match other languages. Which to me screams regex.

"wall".comb.grep( /^ <:Ll> & <:ascii> $/ ).map( *.ord )

This matches Lowercase Letters that are also in ASCII.

Actually we can make it even simpler. comb can take a regex which determines which characters it takes from the input.

"wall".comb( / <:Ll> & <:ascii> / ).map( *.ord )
# (119, 97, 108, 108)

"ΓΔαβγδε".comb( / <:Ll> & <:Greek> / ).map( *.ord )
# (945, 946, 947, 948, 949)
# Does not include Γ or Δ, as they are not lowercase

Note that the above only works with ASCII if you don't have a combining accent.

 "de\c[COMBINING ACUTE ACCENT]f".comb( / <:Ll> & <:ascii> / )
 # ("d", "f")

The Combining Acute Accent combines with the e which composes to Latin Small Letter E With Acute. That composed character is not in ASCII so it is skipped.

It gets even weirder if there isn't a composed value for the character.

"f\c[COMBINING ACUTE ACCENT]".comb( / <:Ll> & <:ascii> / )
# ("f́",)

That is because the f is lowercase and in ASCII. The composing codepoint gets brought along for the ride though.

Basically if your data has, or can have combining accents and if it could break things, then you are better off dealing with it while it is still in binary form.

$buf.grep: {
    .uniprop() eq 'Ll' #
    && .uniprop('Block') eq 'Basic Latin' # ASCII
}

The above would also work for single character strings because .uniprop works on either integers representing a codepoint, or on the actual character.

"wall".comb.grep: {
    .uniprop() eq 'Ll' #
    && .uniprop('Block') eq 'Basic Latin' # ASCII
}

Note again that this would have the same issues with composing codepoints since it works with strings.

You may also want to use .uniprop('Script') instead of .uniprop('Block') depending on what you want to do.

Come to think of it, if the characters are in a list, you could just use `.comb: /@ascii/` and add additional groups with a `|`: `.comb: /@ascii | @non-ascii/` — user0721090601, Sep 08 '20 at 17:10
The `"wall".comb( / <:Ll> & <:ascii> / ).map( *.ord )` is both short and comprehensible. — Lars Malmsteen, Sep 11 '20 at 19:08

jubilatious1 · Answer 3 · 2021-05-17T18:23:07.860

4

Here's a working approach using Raku's trans method (code snippet performed in the Raku REPL):

> my @a = "wall".comb;
[w a l l]
> @a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put;
119 97 108 108

Above, we handle an ascii string. Below I add the "é" character, and show a 2-step solution:

> my @a = "wallé".comb;
[w a l l é]
> my @b = @a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') );
[119 97 108 108 é]
> @b.trans("é" => ords("é")).put
119 97 108 108 233

Nota bene #1: Although all the code above works fine, when I tried shortening the alphabet to 'a'..'z' I ended up seeing erroneous return values...hence the use of the full 'abcdefghijklmnopqrstuvwxyz'.

Nota bene #2: One question in my mind is trying to suppress output when trans fails to recognize a character (e.g. how to suppress assignment of "é" as the last element of @b in the second-example code above). I've tried adding the :delete argument to trans, but no luck.

EDITED: To remove unwanted characters, here's code using grep (à la @Brad Gilbert), followed by trans:

> my @a = "wallé".comb;
[w a l l é]
> @a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108

edited May 17 '21 at 18:23

answered Sep 09 '20 at 00:57

jubilatious1

1,999
10
18

1

At first glance `trans` *should* work for this SO. Unfortunately its detailed behavior is baroque, as you've encountered. I did too in 2018 and then documented it in [this gist](https://gist.github.com/raiph/a9d58825662b5cf2da2cc550cb3c6989). I'm hopeful someone (perhaps me, but thus far I've found I do my best Raku core work as an explorer and QA person rather than core dev) will champion redoing `trans` for `raku.f`. My gist is an attempt to help toward that. I hope to have time later today to explore your two NBs. For now, #1 => "crazier stuff"? #2 => "Replacers extension" or "`:delete`"? – raiph Sep 09 '20 at 09:55
@raiph what is stopping you turning this into a module? :-) – Elizabeth Mattijsen Sep 09 '20 at 10:35
@raiph Thank you for your comment. I edited to clarify that all the above code works, I just [NB1] cannot shorten the alphabet into the range `'a'..'z'` without incurring erroneous return values. Also [NB2] it seems certain from the Raku Docs that `:delete` _should_ work. I see this as an important question, for example, when converting a string/array to ascii (numeric) values when string/array already contains numerics. – jubilatious1 Sep 09 '20 at 10:56
1

@ElizabethMattijsen 1. `trans` as it stands has what I need, and think most sane people would need, provided the bugs and bizarre undocumented behaviors are avoided. 2. When I looked at the relevant compiler code it was way, way too complex for me. 3. When I looked at the spec, aka roast tests, *they* were way too complex too. 4. As a weak swimmer, I don't want to dive off into what I consider a deep end. 5. I don't have strong opinions on what `trans` behaviour should be, just that it would be best if it were not as bizarre and undocumented as it currently is (or was last time I explored it). – raiph Sep 09 '20 at 13:16
2

"the above code works" I didn't mean to imply otherwise. I was focused on the NBs, trying to convey that doing the sorts of things you suggested in the two NBs should work. "I see this [NB2] as an important question". I think both your NBs are important, as are others. I think there's a good chance that both your NBs are explained by what I documented in my gist. I hope to have a decent block of time to focus on your NBs later today. – raiph Sep 09 '20 at 13:22
@jubilatious1 I just took a brief gander. It looks to me like your NB#1 is covered in the early section in my gist **Using a single string as a matcher...** That is to say, as it explains, you should be able to use`'a..z'` (not `'a'..'z'`). I think your NB#2 is covered in the **Replacers extension** section. That is to say, I think you just need to append the `é` to the LHS matcher string (as well as specifying `:delete`). My gist should explain (to me at least!) how to arrive at more general and complex `trans` solutions as well as understanding and avoiding the more bizarre behaviors. – raiph Sep 09 '20 at 13:53
@raiph Time for an extended conversation elsewhere? I really prefer to stretch out in an email format. Feel free to start a thread on Perl6-Users, and put the NNTP link up here (for those who want to follow). – jubilatious1 Sep 09 '20 at 17:15
1

I didn't know about the `trans` method. It seems fit for the purpose of this post. It acts kind of like a lookup table. – Lars Malmsteen Sep 11 '20 at 19:25
@LarsMalmsteen I understood from your post that you wanted to use the same code for non-ascii languages, hence you couldn't the consecutively increasing ascii sequence (97, 98, 99 ... 122). Do you anticipate using this code for text containing two-or-more languages? That was my concern in the third code snippet. – jubilatious1 Sep 11 '20 at 19:36
1

@jubilatious1 Even the first snippet works quite fine. For Turkish language for instance `my $tr = 'abcçdefgğhıijklmnoöprsştuüvy';` `"ağrı".comb.trans($tr => ords($tr)).put;` it gives the desired output `97 287 114 305` so I really didn't need to try or use the next ones. – Lars Malmsteen Sep 11 '20 at 19:49
@LarsMalmsteen Good to hear! Glad the code is useful to you. – jubilatious1 Sep 12 '20 at 18:19
1

@jubilatious1 I only just spotted your comments on my gist. I've replied to your comments on the gist. The final line is `put "wallé" .comb .trans: :delete, ('a'..'z', 'é') .flat => ('a'..'z') .join .ords; # 119 97 108 108`. – raiph Nov 28 '20 at 01:29

Convert a word's characters into its ascii code list concisely in Raku

3 Answers3