Strip "unusual" unicode characters
In the comments you mention that you want to block out control characters while keeping the Greek characters, so the solution below with tr does not suit. One solution is sed
which offers unicode support and their [[:alpha:]]
class matches also alphabetical characters outside ascii. You first need to set LC_CTYPE
to specify which characters all fall into the [[:alpha:]]
range. For German with Umlauts, that's e.g.
LC_CTYPE=de_DE.UTF-8
Then you can use sed
to strip out everything which is not a letter or punctuation:
sed 's/[^[:alpha:];\ -@]//g' < junk.txt
What \ -@
does: It matches all characters in the ascii range between space and @
(see ascii table. Sed has a [[:punct:]]
class, but unfortunately this also matches a lot of junk, so \ -@
is needed.
You may need to play around a little with LC_CTYPE
, setting it to utf-8
only I could match greek characters, but not japanese.
If you only care about ascii
If you only care about regular ascii characters you can use tr
: First you convert the file to a "one byte per character" encoding, since tr
does not understand multibyte characters, e.g. using iconv
.
Then, I'd advise you use a whitelist approach (as opposed to the blacklist approach you have in your question) as it's a lot easier to state what you want to keep, than what you want to filter out.
This command should do it:
iconv -c -f utf-8 -t latin1 < junk.txt | tr -cd '\11\12\40-\176'
this line..
- converts to latin1 (single byte per char) and ignores all characters above codepoint 127 (which are the special characters, but be aware, that strips away also things like umlaut or special characters in your language which you might want to keep!)
- strips all characters away which are outside this whitelist:
\11\12\40-\176
. The numbers there are octal. Have a look at e.g. this ascii table. \11
is tab
, \12
is carriage return. \40-\176
is all characters which are commonly considered as "normal"