Yet another code page detection question

Question

OK, before you jump at me with spears and take me away to the burning battlefield of code pages, please note that I am not trying to auto-detect the code page of a text. I know that's not possible. But what I do not know to be possible is to automatically detect a code page problem. Take the following example. I have a largish text (2-3 pages) plus a "default" code page. I try to decode the text with the default code page. If I get gibberish I try to decode the text with another code page. So the question is: is it possible to somehow detect gibberish characters?

Thanks for your kind help in advance. Best Regards, Daniel

Strictly speaking you're not decoding the text when you apply a code-page. At a basic level you're choosing a character set to display the text in. There's a one-to-one mapping between the data and the displayed characters. Therefore, you cannot detect the gibberish character as the data is the same. — Tim Lloyd, Aug 09 '11 at 12:49
It' not all Chinese to everyone (pun intended). How would you define "gibberish" in a general way? Having that said, depending on your environment, specification and/or requirements you might get away with something like "unprintable" or "control" characters. YMMV. — Christian.K, Aug 09 '11 at 12:49
Write an Artificial Intelligence based program that is capable of detecting gibberish. It needs to reverse engineer the language of the text first, the harder problem given that there are about 3000 of them. Then it is just a matter of basic grammar rules and dictionary lookup. Easy. — Hans Passant, Aug 09 '11 at 13:04
Defining gibberish in a general way? I do not know about, that's why I asked. It might not be possible at all. I don't try to detect the language as in my case it could be many different languages. — Daniel, Aug 09 '11 at 13:05

Steve Morgan · Accepted Answer · 2011-08-09T13:27:55.727

I reckon that the only practical way is to manually define some kind of 'mask' for each code page; a structure that defines all of the character values that you consider valid for each of your code pages.

Then, you could check if the page contained any character values that weren't contained in this mask.

Building the mask would involve a fair bit of manual effort. Create a page with every character, then display it using the appropriate code page and then look to see which aren't rendered 'nicely'. It's a one-off activity for each code page, so perhaps worth the effort.

Of course, if there was a way to parse a code page, you could generate this mask automatically... Hmm... Back in a bit.

Try this code fragment. It tests the characters 32-255 against each known code page.

        StringBuilder source = new StringBuilder();

        for (int ix = 0; ix < 224; ix++)
        {
            source.Append((char)(ix + 32));
        }

        EncodingInfo[] encs = Encoding.GetEncodings();

        foreach (var encInfo in encs)
        {
            System.Console.WriteLine(encInfo.DisplayName);
            Encoding enc = Encoding.GetEncoding(encInfo.CodePage);

            var result = enc.GetBytes(source.ToString().ToCharArray());

            for (int ix = 0; ix < 224; ix++)
            {
                if (result[ix] == 63 && source[ix] != 63)
                {
                    // Code page translated character to '?'
                    System.Console.Write("{0:d}", source[ix]);
                }
            }
            System.Console.WriteLine();
        }

I was looking around in the debugger and noticed that '?' is used as a fall-back character if the source character is not included in the code page. By checking for '?' (and ensuring that it wasn't '?' to start with), the code assumes that the code page couldn't handle it.

DBCS code pages may need a bit more attention, I've not looked. But try this as a starting point.

I'd use code like this to build an initial 'mask', as I described earlier, and then manually adjust that mask based on what looked good and what didn't.

You're awesome. Thanks, I'll give it a shot. This is EXACTLY the approach I was looking for. — Daniel, Aug 09 '11 at 13:30
Very interesting answer, but aren't there going to be lots of situations though where multiple code pages have all the required characters for the same piece of text? There wouldn't be any fall-back characters in this instance. — Tim Lloyd, Aug 09 '11 at 13:32
@chibacity If there aren't any fall-back characters, then presumably there's no problem. It's the presence of fall-back characters that imply that the code page can't handle the text (the 'gibberish' to which the OP referred). Also, the OP reckons only a small number of code pages are required. But as I suggested, I wouldn't use this approach as-is, but would use it to pre-populate a 'mask' that would give a finer degree of control. I don't think the OP is expecting to accurately derive the 'best' code page, though, just try to find one that looks reasonable. — Steve Morgan, Aug 09 '11 at 14:17
@Steve The point that I was pondering is that multiple code-pages may have all the "code-points" for a given piece of text, and that this situation might not be uncommon. However, they could display the same text in completely different language characters, therefore, there would be no fallback characters, but to the user it would be gibberish. I do like your answer! — Tim Lloyd, Aug 09 '11 at 14:28
@chibacity Yes, you're quite right. This mechanism defines gibberish as being absent from the code page and only applies at a character level. A somewhat simplistic approach, but hopefully good enough in this case. I like the fact that you like my answer :-D — Steve Morgan, Aug 09 '11 at 14:40
This will work perfectly for me as there are only two code pages to use: the default and the fallback (if the fallback can't handle the text, then that was it). I'm also not shooting for a perfect solution at the moment. I'll be glad if the algorithm can solve 50% of the issues. — Daniel, Aug 09 '11 at 14:52

Yet another code page detection question

1 Answers1