0

I have a file, from which I need to extract all control characters, but I cannot understand what is going on.

$ cat -v -e -t values.xml | head -n 10
<?xml version="1.0" encoding="UTF-8"?>^M$
^I^I^I<HHDGSID>1</HHDGSID>^M$
^I^I^I<SEHJJE>1</SEHJJE>^M$
^I^I^I<ADRTYPE>0</ADRTYPE>^M$
^I^I^I<TESTJGHJTE>30/10/2000</TESTJGHJTE>^M$

When I search for [:cntrl:] characters, I get letters like the l in the below row:

<?xml version="1.0" encoding="UTF-8"?>^M$

How should I handle this?

Here is a sample of my file:

<?xml version="1.0" encoding="UTF-8"?>
            <SOME>1</SOME>
            <SOMEEXTRA>1</SOMEEXTRA>
            <ADRTYPE>0</ADRTYPE>
            <SOMEEXTRADATE>30/10/2000</SOMEEXTRADATE>
            <SOMEEXTRACDATE>30/10/2000</SOMEEXTRACDATE>
            <CODE>0</CODE>
            <CEBY>1</CEBY>
        </ORD>
Armali
  • 18,255
  • 14
  • 57
  • 171
StevenH
  • 9
  • 4
  • 2
    try with `dos2unix inputfile` – P.... Dec 21 '16 at 11:39
  • See http://stackoverflow.com/questions/14680100/removing-control-characters-from-a-file – martin clayton Dec 21 '16 at 11:56
  • i have updated the question, with a link to a sample file i have, i still get the ^I characters in the beginning of each line... – StevenH Dec 21 '16 at 11:57
  • 1
    `^I` is the tab. Remove it with awk like: `awk '{gsub(/\t/,"") 1} file`. Since I is the 9th character in alphabet, it represents the character decimal 9. The bell is `^G` the 7th etc.. You can also remove it like: `gsub(/\x09/,"")`. – James Brown Dec 21 '16 at 11:58
  • 3
    I'm pretty sure you should be able to just ignore it, as it's irrelevant to the XML semantics. Just use an XML Parser in the first placfe. – Sobrique Dec 21 '16 at 12:39
  • an example with an xml parser? i am not sure i have understood – StevenH Dec 21 '16 at 13:04
  • 1
    What you have in the linked file are just tabs. I don't know why you show them as `^I` here (but that does stand for a tab). They are considered as white space and easily understood by many environments. If you need to do some work with that file, use an XML parser, which will deal with tabs just by the way. If you want to just remove those, `perl -pe 's/^\t+//' input > output` will strip all leading tabs. An editor may replace tabs with spaces, so it looks like they are there but there is no tab character in fact. Then you can use `s/^\s+//` instead, to remove all leading white space. – zdim Dec 21 '16 at 18:19
  • 1
    Using an xml parser would mean that you load a module, like [XML::LibXML](http://search.cpan.org/dist/XML-LibXML/LibXML.pod) or [XML::Twig](http://search.cpan.org/~mirod/XML-Twig-3.49/Twig.pm), and use its methods and functions to open and work with a file. See linked docs, and many SO posts. – zdim Dec 22 '16 at 00:26
  • It's unclear what you're trying to do and why, and what tool(s) you're using, and what you've tried so far. I'm voting to close this question because "How should I handle this?" can't be answered without knowing what you mean by "this". – melpomene Dec 24 '16 at 00:46

2 Answers2

0

You could try this:

while (<>) {
   s/\cX//g; # removes ^X's
   s/\cI//g; # removes ^I's
   ...  
}
ProgAndPlay
  • 277
  • 1
  • 12
0

When I search for [:cntrl:] characters, I get letters like the l in the below row:

<?xml version="1.0" encoding="UTF-8"?>^M$

man 7 regex says:

Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class.

So, since the [:cntrl:] has to be within a bracket expression, you have to search for [[:cntrl:]].

The [:cntrl:] alone is just a bracket expression which matches any single character from the list :, c, n, t, r, l, hence it matches the l in <?xml ….

Armali
  • 18,255
  • 14
  • 57
  • 171