Replacing Non-Ascii characters with some Alphabets to maintain the file format same

Question

Here is my code :

byte[] bytes = File.ReadAllBytes(@"D:\project\wb_header.txt");
byte[] outp = bytes.Where(c => c >= 32 && c < 127).ToArray();
File.WriteAllBytes(@"D:\project\outputfile.txt", outp);

Here I am counting All Non- Ascii characters in wb_header.txt file and after removing the non-ascii characters the output file is creating. But the problem is that I do not want to remove the characters and I want to replace it some alphabets or ASCII characters to maintain the same file format as the wb_header.txt file. How to do that ? Kindly include some code here

You're completely ignoring the encoding of the source file. This could be problematic, because of multi-byte encodings. What are you **actually** trying to do? — spender, Jan 07 '14 at 21:09
I want to replace the non ascii characters with some ascii values at the output actually. — Debopam, Jan 07 '14 at 21:24

score 3 · Answer 1 · edited May 23 '17 at 10:32

Your code works only if your original file is ASCII, in this case if you just want to remove non printable (such old definition, I know) characters:

byte[] output = bytes
    .Select(c => (c >= 32 && c <= 127) ? c : (byte)63).ToArray();

This removed all non printable characters replacing them with ? (question mark, ASCII code 63).

Now let's see why your original code doesn't work for non ASCII files. Text has always an encoding (ASCII, UTF-8, UTF-16 and many others). First 127 values are the same in most encodings so your code may work but there are characters that to be encoded will need more than one byte. For example this Italian sentence: È sù will be encoded in UTF-8 like this:

Bytes     Characters       
195 136   È
32         
115       s
195 185   ù

As you can see some characters need more than one byte. Values are of course different for different encodings. Moreover some text files have a BOM marker to explicit their encoding (and it should just be ignored). There are some techniques to convert, for example, è in e (some very good articles here on SO) but unfortunately you can never treat non ASCII text as bytes (I don't even mention far east languages). A good approximation can be to first read file as text (if you don't know its encoding framework can check for you if file begins with BOM, this is a serious problem because unless you have some strong knowledge about file content you can't guess, see this too):

string content = File.ReadAllText("file.txt");

Now let's get its representation in ASCII (non ASCII characters will be automatically replaced with ?):

byte[] output = Encoding.ASCII.GetBytes(content);

This byte array will include non printable characters (outside [32...127] range) then you still may need to apply same filter:

byte[] output = Encoding.ASCII.GetBytes(content)
    .Select(c => (c >= 32 && c <= 127) ? c : (byte)63).ToArray();

Final notes: this code is not efficient at all. We read all file in memory, we convert to an array of bytes (in memory) then we create a copy (again in memory) to finally write it back...for files bigger than few kilobytes you should read characters directly from file (one by one) with proper encoding. Ah...do not forget UNICODE surrogates and modifiers...

It is give me an error like can not implicitly convert type system.collections.generic.IEnumerable to byte[] — Debopam, Jan 07 '14 at 21:18
@Debopam I did forget ToArray(), I updated with more examples too (about why your code will fail for non ASCII input files). — Adriano Repetti, Jan 07 '14 at 21:28

Replacing Non-Ascii characters with some Alphabets to maintain the file format same

1 Answers1