3

I have a text file that contains binary control characters, such as "^@" and "^M". When I try to perform string operations directly on the text file, the control characters crash the script.

Through trial and error, I discovered that the more command will strip the control characters so that I can process the file properly.

more file_with_control_characters.not_txt > file_without_control_characters.txt

Is this considered a good method, or is there a better way to remove control characters from a text file? Does more have this behavior in OSes earlier than Windows 8?

kbulgrien
  • 4,384
  • 2
  • 26
  • 43
svengineer99
  • 43
  • 1
  • 6
  • There is no commands in CMD/Win8 to do what you trying to do (which sounds like filter out non alpha-numerical bytes from binary file)... Side note: You would be much better of finding reader for file format you trying to crawl through... – Alexei Levenkov Dec 20 '15 at 07:48
  • 2
    Please read [How to ask a good question](http://stackoverflow.com/help/how-to-ask) and post a question so we can help you. – Vedda Dec 20 '15 at 07:49
  • I am reporting a method that DOES work. I am using batch for communication with a concurrently running .exe with limited file output options. – svengineer99 Dec 20 '15 at 07:59
  • My question is - will this method work on earlier/later windows OS (Win98, etc)? – svengineer99 Dec 20 '15 at 08:03
  • 1
    I've only ever seen control characters get added to a text file like that when you `ftp` a text file and you forget to use `asc` mode. – SomethingDark Dec 20 '15 at 12:24
  • `more` is good to convert Unicode text files to ANSI ones; but it does _not_ surely clean up an ANSI text file; [tag:batch-file] is actually a bad choice for such kind of tasks as there are only quite limited features for file handling... – aschipfl Dec 20 '15 at 12:39
  • Appreciate the replies on this. I understand that batch is generally a bad choice for such tasks. I'm not sure why more works to strip the problematic control characters from the pseudo-text file format I'm trying to operate on, but it does. Therefore it seems a good solution for my specific case. – svengineer99 Dec 20 '15 at 19:03
  • My remaining question is - is it reasonable to expect this method to work the same under Win98, etc? The reason I ask is that it's something I'd like to share with others trying to operate on the same pseudo-text file format but possibly running on different OS. Including set __COMPAT_LAYER=WINXPSP3 still worked OK, but I'm not sure that's a valid way to test it. – svengineer99 Dec 20 '15 at 19:03

2 Answers2

3

Certainly you do not want to simply remove all control characters. Newline and Tab characters are control characters as well, and you don't want to remove those.

I'm assuming your ^M is a carriage return, and ^@ is a NULL byte. The carriage returns are not causing you problems, and MORE does not remove them. But NULL bytes can cause problems if your utility is expecting ASCII text files.

Your input file is most likely UTF-16. MORE is converting the UTF-16 into ANSI (extended ASCII) format, which does effectively remove the NULL bytes. It also converts non-ASCII values into extended ASCII characters in the decimal 128 - 255 byte value range. I believe it uses your active code page (CHCP) value to figure out what characters map where, but I'm not positive.

You should be aware of some additional issues.

  • MORE will convert all Tab characters into a series of spaces, and you cannot control how many spaces (it varies depending on the current position in the line).

  • MORE will always terminate each line with \r\n (carriage return and line feed).

  • MORE also removes the two byte BOM at the beginning of the file, if it exists. The BOM indicates the UTF-16 format. But MORE does not require the 2 byte BOM indicator, it will convert the UTF-16 to ANSI regardless.

  • Lastly MORE can hang indefinitely if your file exceeds 64K lines.

If MORE works for you, than by all means use it.

One other option is to use TYPE, which will also convert UTF-16 to ANSI:

type "yourFile.txt" >"newFile.txt"

TYPE definitely maps non-ASCII codes based on the active code page.

There are some differences with how TYPE converts vs. MORE

  • One advantage of TYPE is it does not convert Tab characters to spaces.

  • Another advantage is it will not hang with large files.

  • Another difference (maybe good, maybe bad) is it will not add a line terminator to a line that does not already have one.

  • A potential disadvantage of TYPE is it will not convert UTF-16 to ANSI if the input is missing the BOM.

dbenham
  • 127,446
  • 28
  • 251
  • 390
  • Thank you for this extremely detailed, helpful and useful answer! My input file is of very specialized and well defined format ( I am actually generating this file from a script I wrote running on another concurrent executable) so I can take care to avoid the potential more issues you kindly documented. If I do run into one of these issues I now know that type could be an improved option. Thank you again for your time and care to answer this question. – svengineer99 Dec 21 '15 at 06:02
1

Hi, sorry for replying to this old thread but I have seen this question being asked in many places, even several times here. This might as well help other people. I tried the type command as suggested by @dbenham but it did not work.

This can be done by cat -v file > newfile
Credit to Roel Van de Paar from youtube.
You can remove the ^@ characters from the file with sed
Example: sed 's/\^@//g' newfile > newfile.out