Removing non-printable characters with sed not working

Question

I am working on AIX unix and trying to remove non-printable characters from file the data looks like in Arizona w/ fiancÃÂÃÂÃÂ in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix she I get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒

I want to replace all those special characters with space and my output should look like in Arizona w/ fianc

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

Those characters are still "printable", maybe you meant `[:alpha:]`? — Thor, Sep 25 '18 at 18:37
@Thor I can't see them on shell the appear as ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒^ on unix shell notepad ++ shows them when I change encoding to UTF-8 — Auguster, Sep 25 '18 at 18:40
You need to upload a sample somewhere, e.g. pastebin, otherwise it will be hard to give a useful answer — Thor, Sep 25 '18 at 19:14
@Thor I can't use pastebin blocked at my work network any other service you recommend? — Auguster, Sep 25 '18 at 19:19
What is your operating system? Do you have a package like glibc-common installed? — Bsquare ℬℬ, Oct 22 '18 at 13:03
Are you sure you tried `sed -e 's/[^ -~]//g' file > newfile`? It must work if the chars you want to remove are outside the SPACE-TILDE char range. Maybe `LANG=C sed -e 's/[^ -~]//g' file > newfile` will work (though it seems redundant). Try `awk '{gsub(/[^ -~]/,"",$0)}1' file > newfile`, too. — Wiktor Stribiżew, Jan 16 '19 at 08:40

caxcaxcoatl · Answer 1 · 2019-12-28T02:22:22.017

Easiest - `strings`

Easiest way to do this is with the strings command:

$ cat  /tmp/asdf
in Arizona w/ fiancÃÂÃÂÃÂ
$ strings  /tmp/asdf
in Arizona w/ fianc

The problems with this approach:

It's not using sed
It adds an end of line whenever it finds any non-printable character (it should be ok in your example, as they're all grouped at the end, but it will fail otherwise)

Ugliest - `sed`'s `l` plus `sed` post-processing

Now, if you must use sed, then here's an alternative:

$ sed -n l /tmp/asdf | sed -E 's/\\[[:digit:]]{3}//g; s/\$$//'
in Arizona w/ fianc

Here, you're using l to 'dump' non-printable characters, transforming them into octal representations like \303, then removing anything that looks like an octal value so created, and then removing the $ that l added at the end of the line.

It's kinda ugly, and may interact badly with your file, if it has anything which starts with a backslash followed by three digits, so I'd stay with the strings option.

Better - `sed` ranges with high Unicode characters

The one below is also a hack, but looks better than the rest. It uses sed ranges, starting with '¡'. I picked that symbol because it is the second* character in the iso-8859-1 encoding, which also happens to be the Unicode section right after ASCII. So, I'm guessing that you're not having trouble with actual control codes, but instead of non-ASCII characters (anything represented over 127 Decimal).

For the second item in the range, just pick some non-latin character (Japanese, Chinese, Hebrew, Arabic, etc), hoping it will be high enough in Unicode that it includes any of your 'non-printing' characters.

Unfortunately, sed does not have a [[:ascii:]] range. Neither it accepts open-ended ranges, so you need this hack.

$ sed 's/[¡-ﺏ]/ /g' /tmp/asdf
in Arizona w/ fianc

(*) Note: I picked the second character in the range because the first character is a non-breaking space, so it would be hard to understand that it is not just a normal space.

Removing non-printable characters with sed not working

1 Answers1

Easiest - strings

Ugliest - sed's l plus sed post-processing

Better - sed ranges with high Unicode characters

Easiest - `strings`

Ugliest - `sed`'s `l` plus `sed` post-processing

Better - `sed` ranges with high Unicode characters