-2

We have a bash script running on prod. Occasionally we receive control characters inside the bash script as output which is sent somewhere else to be rendered.

Is there any way using tr/awk/sed or anything else to translate/convert control characters from (0-1f) (hex) to unicode escaping (\u0000 - \u0037)(octal) [except for newline "\n"]

  • We do not want to use perl (ord) inside the bash script. (Increases cpu usage)
  • We do not want to remove the control characters (makes the output look ugly)

Simple Example:

echo "Hello, this \n is a new line. This \t is a tab"

Should become:

Hello, this
is a new line. This \u0011 is a tab

Reference:

ASCII table: http://www.asciitable.com/

Control Characters: https://en.wikipedia.org/wiki/Control_character

Div
  • 1
  • 1
  • Related: https://stackoverflow.com/questions/28176578/convert-utf-8-unicode-string-to-ascii-unicode-escaped-string – kvantour Jun 28 '19 at 15:28
  • Not the format you're asking and not handling `\n`, but `printf '%q'` does display control characters escaped – Aaron Jun 28 '19 at 15:43
  • @kvantour The solution is in Java. – Div Jun 28 '19 at 15:48
  • @Cyrus I cannot post company code here. – Div Jun 28 '19 at 15:49
  • Unicode is usually expressed in hex. In order to use `sed` or `awk` you would basically have to create a lookup table. In the former, it would be unwieldy and ugly. In the latter not much better. In Perl, it's a dozen lines of code. I'll post a Perl script below. – Dennis Williamson Jun 28 '19 at 22:48

2 Answers2

0

Not sure what you goal is. Replace tab? Why tab and not newline?

echo -e "Hello, this \n is a new line. This \t is a tab" | sed 's/\t/\\u0011/g'
Hello, this
 is a new line. This \u0011 is a tab
Jotne
  • 40,548
  • 12
  • 51
  • 55
  • We want to replace all control characters, EXCEPT Newline. And we do not want to replace just one tab but ALL control characters from (0 to 1f) hex. so piping for one control character is not the solution. Reference: http://www.asciitable.com/ – Div Jun 28 '19 at 15:24
0

Here is a Perl script. Other than using a lookup table in another language, it's the most efficient way to do what you want. I think the lookup option would actually be slower because the text would have to be processed character by character.

#!/usr/bin/perl -w

use strict;

while (<>) {
    s{([\x{00}-\x{09}\x{0b}-\x{1f}])}{
        '\u00' . unpack "H*", $1;
    }eg;
    print;
}   

I used unpack here instead of ord. I didn't test their relative performance.

The bracket expression in the substitution includes all the control characters except for newline. I didn't include \x{ff} but it could be added.

Example:

$ echo -e "Hello, this \n is a new line with some \001\037\014 stuff. This \t is a tab" | ./scriptname
Hello, this 
 is a new line with some \u0001\u001f\u000c stuff. This \u0009 is a tab

Your echo command is outputting those escapes as literal backslash-t and backslash-n because you didn't use -e to cause those to be interpreted. I assume that you intended to include the -e so that's what I did here.

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • Thanks for the script. Although your solution is 100% correct, there are a lot of restrictions as to what the bash script I wrote was doing and needed to build this part in bash too. I know that something like `s/([\x{0}-\x{1f}])/"\\u000" . ord($1)/ge` will also work but we went with another way. Thanks for your answer. – Div Jul 02 '19 at 19:39