How to remove control characters from a text file

Question

I have a file, from which I need to extract all control characters, but I cannot understand what is going on.

$ cat -v -e -t values.xml | head -n 10
<?xml version="1.0" encoding="UTF-8"?>^M$
^I^I^I<HHDGSID>1</HHDGSID>^M$
^I^I^I<SEHJJE>1</SEHJJE>^M$
^I^I^I<ADRTYPE>0</ADRTYPE>^M$
^I^I^I<TESTJGHJTE>30/10/2000</TESTJGHJTE>^M$

When I search for [:cntrl:] characters, I get letters like the l in the below row:

<?xml version="1.0" encoding="UTF-8"?>^M$

How should I handle this?

Here is a sample of my file:

<?xml version="1.0" encoding="UTF-8"?>
            <SOME>1</SOME>
            <SOMEEXTRA>1</SOMEEXTRA>
            <ADRTYPE>0</ADRTYPE>
            <SOMEEXTRADATE>30/10/2000</SOMEEXTRADATE>
            <SOMEEXTRACDATE>30/10/2000</SOMEEXTRACDATE>
            <CODE>0</CODE>
            <CEBY>1</CEBY>
        </ORD>

See http://stackoverflow.com/questions/14680100/removing-control-characters-from-a-file — martin clayton, Dec 21 '16 at 11:56
i have updated the question, with a link to a sample file i have, i still get the ^I characters in the beginning of each line... — StevenH, Dec 21 '16 at 11:57
`^I` is the tab. Remove it with awk like: `awk '{gsub(/\t/,"") 1} file`. Since I is the 9th character in alphabet, it represents the character decimal 9. The bell is `^G` the 7th etc.. You can also remove it like: `gsub(/\x09/,"")`. — James Brown, Dec 21 '16 at 11:58
I'm pretty sure you should be able to just ignore it, as it's irrelevant to the XML semantics. Just use an XML Parser in the first placfe. — Sobrique, Dec 21 '16 at 12:39
an example with an xml parser? i am not sure i have understood — StevenH, Dec 21 '16 at 13:04
What you have in the linked file are just tabs. I don't know why you show them as `^I` here (but that does stand for a tab). They are considered as white space and easily understood by many environments. If you need to do some work with that file, use an XML parser, which will deal with tabs just by the way. If you want to just remove those, `perl -pe 's/^\t+//' input > output` will strip all leading tabs. An editor may replace tabs with spaces, so it looks like they are there but there is no tab character in fact. Then you can use `s/^\s+//` instead, to remove all leading white space. — zdim, Dec 21 '16 at 18:19
Using an xml parser would mean that you load a module, like [XML::LibXML](http://search.cpan.org/dist/XML-LibXML/LibXML.pod) or [XML::Twig](http://search.cpan.org/~mirod/XML-Twig-3.49/Twig.pm), and use its methods and functions to open and work with a file. See linked docs, and many SO posts. — zdim, Dec 22 '16 at 00:26
It's unclear what you're trying to do and why, and what tool(s) you're using, and what you've tried so far. I'm voting to close this question because "How should I handle this?" can't be answered without knowing what you mean by "this". — melpomene, Dec 24 '16 at 00:46

score 0 · Answer 1 · answered Dec 27 '16 at 17:25

0

You could try this:

while (<>) {
   s/\cX//g; # removes ^X's
   s/\cI//g; # removes ^I's
   ...  
}

answered Dec 27 '16 at 17:25

ProgAndPlay

277
1
12

score 0 · Answer 2 · answered Jan 14 '19 at 08:22

When I search for [:cntrl:] characters, I get letters like the l in the below row:
<?xml version="1.0" encoding="UTF-8"?>^M$

man 7 regex says:

Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class.

So, since the [:cntrl:] has to be within a bracket expression, you have to search for [[:cntrl:]].

The [:cntrl:] alone is just a bracket expression which matches any single character from the list :, c, n, t, r, l, hence it matches the l in <?xml ….

How to remove control characters from a text file

2 Answers2