Remove junk characters from a utf-8 file in Unix

Question

I'm getting the junk chars (<9f>, <9d>, <9d> etc), CNTRL chars (^Z,^M etc) and NULL chars(^@) in a file. However I was able to remove CNTRL and NULL chars from the file but couldn't eliminate the junk characters. Could anyone suggest a way to remove these junk chars?

Control characters are being removed using the following command:

sed 's/\x1a//g;s/\xef\xbf\xbd//g'

Null characters are removed using the below command

tr -d '\000'

Also, Please a suggest a single command to remove all the above mentioned 3 types of garbal characters.

Thanks in Advance

Dos2Unix is also not working. the error states 'problems converting file' — user2975559, Jan 17 '17 at 15:11
in what encoding is the file? Can you paste the file on pastebin.com? — hansaplast, Jan 17 '17 at 15:21
the file encoding is utf-8. Sorry, I couldn't place the file. — user2975559, Jan 17 '17 at 15:23

hansaplast · Answer 1 · 2017-01-18T07:34:43.573

Strip "unusual" unicode characters

In the comments you mention that you want to block out control characters while keeping the Greek characters, so the solution below with tr does not suit. One solution is sed which offers unicode support and their [[:alpha:]] class matches also alphabetical characters outside ascii. You first need to set LC_CTYPE to specify which characters all fall into the [[:alpha:]] range. For German with Umlauts, that's e.g.

LC_CTYPE=de_DE.UTF-8

Then you can use sed to strip out everything which is not a letter or punctuation:

sed 's/[^[:alpha:];\ -@]//g' < junk.txt

What \ -@ does: It matches all characters in the ascii range between space and @ (see ascii table. Sed has a [[:punct:]] class, but unfortunately this also matches a lot of junk, so \ -@ is needed.

You may need to play around a little with LC_CTYPE, setting it to utf-8 only I could match greek characters, but not japanese.

If you only care about ascii

If you only care about regular ascii characters you can use tr: First you convert the file to a "one byte per character" encoding, since tr does not understand multibyte characters, e.g. using iconv.

Then, I'd advise you use a whitelist approach (as opposed to the blacklist approach you have in your question) as it's a lot easier to state what you want to keep, than what you want to filter out.

This command should do it:

iconv -c -f utf-8 -t latin1 < junk.txt | tr -cd '\11\12\40-\176'

this line..

converts to latin1 (single byte per char) and ignores all characters above codepoint 127 (which are the special characters, but be aware, that strips away also things like umlaut or special characters in your language which you might want to keep!)
strips all characters away which are outside this whitelist: \11\12\40-\176. The numbers there are octal. Have a look at e.g. this ascii table. \11 is tab, \12 is carriage return. \40-\176 is all characters which are commonly considered as "normal"

Thanks for the suggestion. I have used this tr -cd '\11\12\40-\176' command already. All the junk characters have been removed. However, I'm losing some greek characters as well. I need those greek chars — user2975559, Jan 17 '17 at 16:22
And also if i convert the file from utf-8 to us-latin using iconv i'm losing all the Unicode chars sucha as Japan data, Greek data — user2975559, Jan 17 '17 at 16:26
@user2975559 I expanded my answer to also match unicode characters beyond the ascii set, can you check if that solves your problem? — hansaplast, Jan 18 '17 at 07:36

Remove junk characters from a utf-8 file in Unix

1 Answers1

Strip "unusual" unicode characters

If you only care about ascii