1

i have a text file saved by encoding 1256.
as far as my Windows CE 5.0 on my device does not support that code page i can't open the file by that encoding in .NET CF, but the OS supports Unicode. (i showed up some hard code strings in my form) how can i read that file and convert it's data to Unicode?
how can i convert a single character to its UTF8 equivalent bytes?
THIS LINK says that in 1256 code page table the character number 200/C8 is 0x0628. so what's the relation between them? if i have 200/C8 , how can i obtain the 0x0628?

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
losingsleeep
  • 1,849
  • 7
  • 29
  • 46
  • 2
    Is there any chance you could perform the translation somewhere else (app server perhaps), and just give the device data it can natively handle? Otherwise you're just reimplementing Encoding – Marc Gravell Oct 30 '11 at 08:11
  • Regarding this "THIS LINK says that in 1256 code page table the character number 200/C8 is 0x0628. so what's the relation between them? if i have 200/C8 , how can i obtain the 0x0628?": there is no relationship: they are two different systems. You simply have to do what Jon suggests and make a mapping from one to the other by hand. – Adam Cameron Oct 30 '11 at 08:25
  • @Jon Skeet, so does it seem that Windows Code Pages are just simple character mapping?! and .NET framework (or Win32 built-in APIs) just do the same process/mapping Jon did?! ok, thanks for your help. i'm gonna implement it. – losingsleeep Oct 30 '11 at 08:50

1 Answers1

3

It would probably be easiest just to hard code the conversion yourself - create a char[] of 256 values, populate the first 128 positions with just the equivalent numbers, and then populate the rest manually. The "relation" between them isn't one you can get mathematically - it's just a somewhat-arbitrary assignment of values

For example:

private static readonly char[] CodePage1256 = GenerateCodePage1256();

private static readonly char[] GenerateCodePage1256()
{
    char[] ret = new char[256];
    for (int i = 0; i < 128; i++)
    {
        ret[i] = (char) i;
    }
    string upperCharacters =
        "\u20ac\u067e\u201a\u0192\u201e\u2026\u2020\u2021" +
        "\u02c6\u2030"; // etc - from the Wikipedia page

    for (int i = 0; i < 128; i++)
    {
        ret[i + 128] = upperCharacters[i];
    }
}

Then you have a direct byte to char mapping. Of course this is a potentially error-prone process - another possibility would be to create a file with the mapping in, on a system which does have that code page.

Anyway, once you've got the mapping, you can easily convert any array of bytes to a string or char array, at which point you can use the normal .NET classes to write out the file as UTF-8 again. For example:

using (Stream input = File.Open("input.txt"))
{
    using (StreamWriter output = File.CreateText("output.txt"))
    {
        byte[] byteBuffer = new byte[8 * 1024];
        char[] charBuffer = new char[byteBuffer.Length];
        int bytesRead;

        while ((bytesRead = input.Read(byteBuffer, 0, byteBuffer.Length)) > 0)
        {
            for (int i = 0; i < bytesRead; i++)
            {
                charBuffer[i] = CodePage1256[byteBuffer[i]];
            }
            output.Write(charBuffer, 0, bytesRead);
        }
    }
}
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • so does it seem that Windows Code Pages are just simple character mapping?! and .NET framework (or Win32 built-in APIs) just do the same process/mapping u did?! ok, thanks for your help. i'm gonna implement it. – losingsleeep Oct 30 '11 at 08:49
  • @losingsleeep: Yes, that's eaxctly it. – Jon Skeet Oct 30 '11 at 09:04
  • @JonSkeet I hadn't ever thought of it... So composable characters aren't re-composed? a + ` isn't "translated" to à if you go from Unicode to Win-1252? – xanatos Oct 30 '11 at 10:53
  • @xanatos: I wouldn't expect an `Encoding` to do that, no. – Jon Skeet Oct 30 '11 at 12:15
  • @JonSkeet Just tested. No, it doesn't. But the composable ` is changed to a "standard" `. – xanatos Oct 30 '11 at 12:45