30

I want to delete all the control characters from my file using linux bash commands.

There are some control characters like EOF (0x1A) especially which are causing the problem when I load my file in another software. I want to delete this.

Here is what I have tried so far:

this will list all the control characters:

cat -v -e -t file.txt | head -n 10

^A+^X$
^A1^X$
^D ^_$
^E-^D$
^E-^S$
^E1^V$
^F%^_$
^F-^D$
^F.^_$
^F/^_$
^F4EZ$
^G%$

This will list all the control characters using grep:

$ cat file.txt | head -n 10 | grep '[[:cntrl:]]'
+
1

-
-
1
%
-
.
/

matches the above output of cat command.

Now, I ran the following command to show all lines not containing control characters but it is still showing the same output as above (lines with control characters)

$ cat file.txt | head -n 10 | grep '[^[:cntrl:]]'
+
1

-
-
1
%
-
.
/

here is the output in hex format:

$ cat file.txt | head -n 10 | grep '[[:cntrl:]]' | od -t x2
0000000 2b01 0a18 3101 0a18 2004 0a1f 2d05 0a04
0000020 2d05 0a13 3105 0a16 2506 0a1f 2d06 0a04
0000040 2e06 0a1f 2f06 0a1f
0000050

as you can see, the hex values, 0x01, 0x18 are control characters.

I tried using the tr command to delete the control characters but got an error:

$ cat file.txt | tr -d "\r\n" "[:cntrl:]" >> test.txt
tr: extra operand `[:cntrl:]'
Only one string may be given when deleting without squeezing repeats.
Try `tr --help' for more information.

If I delete all control characters, I will end up deleting the newline and carriage return as well which is used as the newline characters on windows. How do I delete all the control characters keeping only the ones required like "\r\n"?

Thanks.

Neon Flash
  • 3,113
  • 12
  • 58
  • 96

4 Answers4

31

Instead of using the predefined [:cntrl:] set, which as you observed includes \n and \r, just list (in octal) the control characters you want to get rid of:

$ tr -d '\000-\011\013\014\016-\037' < file.txt > newfile.txt
Kyle Barbour
  • 985
  • 8
  • 25
  • 4
    Note that this removes the tab character `\t`. Use `tr -d '\000-\010\013\014\016-\037' < file.txt > newfile.txt` to change that – phiresky Jul 10 '18 at 01:04
13

Based on this answer on unix.stackexchange, this should do the trick:

$ cat scriptfile.raw | col -b > scriptfile.clean
KenHBS
  • 6,756
  • 6
  • 37
  • 52
Stephen Boston
  • 971
  • 1
  • 12
  • 23
11

Try grep, like:

grep -o "[[:print:][:space:]]*" in.txt > out.txt

which will print only alphanumeric characters including punctuation characters and space characters such as tab, newline, vertical tab, form feed, carriage return, and space.

To be less restrictive, and remove only control characters ([:cntrl:]), delete them by:

tr -d "[:cntrl:]"

If you want to keep \n (which is part of [:cntrl:]), then replace it temporarily to something else, e.g.

cat file.txt | tr '\r\n' '\275\276' | tr -d "[:cntrl:]" | tr "\275\276" "\r\n"
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • Note that using grep will add a newline to the end of the file even if that wasn't there before. – Oskar Berggren Jan 06 '19 at 11:46
  • 1
    Why '\r\n'? Isn't enough '\n'? – Adrian Maire May 29 '20 at 09:47
  • 1
    @AdrianMaire I believe it's for maximum compatibility across all OSes. – Hashim Aziz Dec 01 '20 at 23:44
  • Also try `tr` with complement on: `tr -d -c '[[:print:][:space:]]'`. As an aside, I recommend using `printf '%q' "$x"`, `${(q)x}`, etc. (escaped views) or `hd <<<"$x"` / `printf '%s' "$x" | hd` (hex views per `hexdump`) rather than trying to examine and compare strings with `\r`, etc. (I immediately confused myself testing this.) – John P Dec 11 '20 at 09:45
  • This turns characters like ½ into newlines. – Tyler V. Mar 02 '21 at 00:42
5

A little late to the party: cat -v <file> which I think is the easiest to remember of the lot!

UKMonkey
  • 6,941
  • 3
  • 21
  • 30