3

I want print 95 ASCII symblols unchanged, but for others to print its codes. How make it in pure perl? 'unpack' function? Any module?

print BackSlashed('test folder'); # expected test\040folder

print BackSlashed('test тестовая folder'); 
# expected test\040\321\202\320\265\321\201\321\202\320\276\320\262\320\260\321\217\040folder

print BackSlashed('НОВАЯ ПАПКА');
# expected \320\235\320\236\320\222\320\220\320\257\040\320\237\320\220\320\237\320\232\320\220

sub BackSlashed() {
my $str = shift;
.. backslashed code here...
return $str
}
Anton Shevtsov
  • 1,279
  • 4
  • 16
  • 34
  • 3
    Note that you are telling Perl that your sub `BackSlashed` has no arguments. The parenthesis `()` in `sub BackSlashed()` are a prototype definition that say it shouldn't have args. Since that's not what you want, remove the parens. – simbabque Jan 18 '17 at 11:12

1 Answers1

3

You can use a regular expression substitution with an evaled substitution part. In there, need to convert each character to its numeric value first, and then output it in octal notation. There's a good explanation for it in this answer. Attach an escaped backslash \ to get it to show up in the output.

$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;

I limited the capture group to basic ASCII letters and numbers. If you want something else, just change the character group.


Since your sample output has octets but you said your code has the use utf8 pragma, you need to convert Perl's representation of the string to the corresponding octet sequence before you run the substitution.

use utf8;
my $str = 'НОВАЯ ПАПКА';
print foo($str);

sub foo { # note that there are no () here!
    my $str = shift;
    utf8::encode($str);
    $str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
    return $str;
}
Community
  • 1
  • 1
simbabque
  • 53,749
  • 8
  • 73
  • 136
  • `my $str='НОВАЯ ПАПКА'; $str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg; print $str; #output is: \2035\2036\2022\2020\2057\040\2037\2020\2037\2032\2020` – Anton Shevtsov Jan 18 '17 at 10:53
  • @AntonShevtsov do you have `use utf8`? If I remove that from my code I get your expected output. With it, I get the four-digit numbers. – simbabque Jan 18 '17 at 11:02
  • All is UTF-8, Russian. I commented 'use utf8' and script work fine. But i need 'use utf8' for my source.. ( – Anton Shevtsov Jan 18 '17 at 11:02
  • 1
    i made it `use utf8; my $str='НОВАЯ ПАПКА'; utf8::encode($str); $str =~ s/([^a-zA-Z0-9])/sprintf "\\%o", ord($1)/eg; print $str;` – Anton Shevtsov Jan 18 '17 at 11:08
  • @Anton I'm updating the answer. With exactly the same you did. I got it at the same time :) – simbabque Jan 18 '17 at 11:08
  • 1
    The code points are not so high (don't forget that values are in octal). For example `\2035` gives `41D` in hexadecimal (that is the code point for `U+041D/Н/CYRILLIC CAPITAL LETTER EN`) – Casimir et Hippolyte Jan 18 '17 at 12:14
  • 1
    ASCII has 128 characters. `[^\P{ASCII}\P{Print}]` probably matches only "the 95 ASCII characters" mentioned by the OP. But that doesn't escape `\ ` as needed, so `[^\P{ASCII}\P{Print}\\]` would be better. They might also want spaces to be escaped, which can be done as follows: `[^\P{ASCII}\P{Print}\\ ]` – ikegami Jan 18 '17 at 16:38
  • You're doing things in the wrong order, but it works anyway, and it's faster to do it in the order you are doing it. The same could not necessarily be said if an encoding other UTF-8 was used. With another encoding, you might have to do it in the proper order: `s/.../ join '', map sprintf("\\%03o", $_), split //, encode($enc, $1)/eg`. – ikegami Jan 18 '17 at 16:41
  • @ikegami I don't understand the last part. Can you clarify that please? I think you have a typo in there somewhere that mixes things up. – simbabque Jan 18 '17 at 16:43
  • Reworded the comment for you. – ikegami Jan 18 '17 at 16:44
  • @ikegami But then why is it the wrong way around? Besides being slower, what would matching it first do right? – simbabque Jan 18 '17 at 16:46
  • Because encoding could produce bytes in 00..7F, so you could end up not escaping bytes that should be escaped. Take for example UTF-16le and the input is `"\x{4123}"`. If you pre-encode, you'd up with `\043A` instead of `\043\101` – ikegami Jan 18 '17 at 16:47
  • @ikegami ah, now I get it. So the encode should be done in the substitute, or the whole thing should be broken up into a loop? – simbabque Jan 18 '17 at 16:50
  • Yes, but you don't need to with UTF-8. It's more efficient to do it the way you did. – ikegami Jan 18 '17 at 16:52
  • No. You can't use `return` that way, and your `encode` is misplaced. Remember, we want to escape the encoded bytes. See my earlier comment for actual code. – ikegami Jan 18 '17 at 16:54
  • I think your code is missing the `ord` @ikegami. Do you want to edit that into the answer, or maybe write up your own one? – simbabque Jan 18 '17 at 17:09
  • 1
    Yes, missing `ord`. And no, it was just an offhand comment: More care needed for encodings other than UTF-8. – ikegami Jan 18 '17 at 19:17