convert ANSI data to Unicode

Question

i have a text file saved by encoding 1256.
as far as my Windows CE 5.0 on my device does not support that code page i can't open the file by that encoding in .NET CF, but the OS supports Unicode. (i showed up some hard code strings in my form) how can i read that file and convert it's data to Unicode?
how can i convert a single character to its UTF8 equivalent bytes?
THIS LINK says that in 1256 code page table the character number 200/C8 is 0x0628. so what's the relation between them? if i have 200/C8 , how can i obtain the 0x0628?

Is there any chance you could perform the translation somewhere else (app server perhaps), and just give the device data it can natively handle? Otherwise you're just reimplementing Encoding — Marc Gravell, Oct 30 '11 at 08:11
Regarding this "THIS LINK says that in 1256 code page table the character number 200/C8 is 0x0628. so what's the relation between them? if i have 200/C8 , how can i obtain the 0x0628?": there is no relationship: they are two different systems. You simply have to do what Jon suggests and make a mapping from one to the other by hand. — Adam Cameron, Oct 30 '11 at 08:25
@Jon Skeet, so does it seem that Windows Code Pages are just simple character mapping?! and .NET framework (or Win32 built-in APIs) just do the same process/mapping Jon did?! ok, thanks for your help. i'm gonna implement it. — losingsleeep, Oct 30 '11 at 08:50

Jon Skeet · Answer 1 · 2011-10-30T08:19:07.527

It would probably be easiest just to hard code the conversion yourself - create a char[] of 256 values, populate the first 128 positions with just the equivalent numbers, and then populate the rest manually. The "relation" between them isn't one you can get mathematically - it's just a somewhat-arbitrary assignment of values

For example:

private static readonly char[] CodePage1256 = GenerateCodePage1256();

private static readonly char[] GenerateCodePage1256()
{
    char[] ret = new char[256];
    for (int i = 0; i < 128; i++)
    {
        ret[i] = (char) i;
    }
    string upperCharacters =
        "\u20ac\u067e\u201a\u0192\u201e\u2026\u2020\u2021" +
        "\u02c6\u2030"; // etc - from the Wikipedia page

    for (int i = 0; i < 128; i++)
    {
        ret[i + 128] = upperCharacters[i];
    }
}

Then you have a direct byte to char mapping. Of course this is a potentially error-prone process - another possibility would be to create a file with the mapping in, on a system which does have that code page.

Anyway, once you've got the mapping, you can easily convert any array of bytes to a string or char array, at which point you can use the normal .NET classes to write out the file as UTF-8 again. For example:

using (Stream input = File.Open("input.txt"))
{
    using (StreamWriter output = File.CreateText("output.txt"))
    {
        byte[] byteBuffer = new byte[8 * 1024];
        char[] charBuffer = new char[byteBuffer.Length];
        int bytesRead;

        while ((bytesRead = input.Read(byteBuffer, 0, byteBuffer.Length)) > 0)
        {
            for (int i = 0; i < bytesRead; i++)
            {
                charBuffer[i] = CodePage1256[byteBuffer[i]];
            }
            output.Write(charBuffer, 0, bytesRead);
        }
    }
}

so does it seem that Windows Code Pages are just simple character mapping?! and .NET framework (or Win32 built-in APIs) just do the same process/mapping u did?! ok, thanks for your help. i'm gonna implement it. — losingsleeep, Oct 30 '11 at 08:49
@JonSkeet I hadn't ever thought of it... So composable characters aren't re-composed? a + ` isn't "translated" to à if you go from Unicode to Win-1252? — xanatos, Oct 30 '11 at 10:53
@JonSkeet Just tested. No, it doesn't. But the composable ` is changed to a "standard" `. — xanatos, Oct 30 '11 at 12:45

convert ANSI data to Unicode

1 Answers1