0

I don't know what the original code is, so I assume that the original code is IBM850 or ISO8859-1.My process below

  1. IBM850 -> UTF8
    if this is OK, I consider the original code is IBM850, if NOK,do next step:

  2. ISO8859-1 -> UTF8
    if this is OK, I consider the original code is UTF8.

But there is a problem, if the original code is ISO8859-1, it will be recognised to IBM850. if the original code is IBM850, it will be recognised to ISO8859-1.

It seems that there are common ground between IBM850 and ISO8859-1.

Who can help me, thanks.

Skyline
  • 53
  • 4

1 Answers1

0

Yes, only the most trivial kind of autodetection is possible by testing whether conversion fails or succeeds. It's not going to work for input encodings where (almost) any input is valid.

You should know something more about your likely output, to test whether if it makes more sense after translating from IBM850 or from ISO8859-1. That's what enca and libenca do. You can probably start with some simple expectations to check:

  1. Does your source happen to be within the ASCII subset of both encodings? Then you're happy with any conversion (but you have no way to know the original encoding at all).
  2. Does your code use box drawing characters? If it does not, it would be easy to reject some candidates for IBM850.
  3. Does your code use control characters from ISO8859-1? If it does not, it would be easy to reject some candidates for ISO8859-1 if codepoints 0x80-0x9F are used.
  4. Do the fragments of your code which are non-ASCII always represent a text in a natural language? Then you can use frequency tables for characters and their pairs, selecting the source encoding which makes the result closer to your natural language(s) on these criteria. (If both variants are almost equally acceptable, it's probably better to give an error message and leave the final decision to humans).
Anton Kovalenko
  • 20,999
  • 2
  • 37
  • 69