6

Today I received a file from a customer that I have to read, but it contains strange characters. Using known names, I can guess the meaning of some characters.

For example:

Realname  | Encoded as   | sign  | hex
----------|--------------|-------|-------
Françios  | Fran?ºios    | ç     | 3f ba
André     | Andr??       | é     | 3f 3f
Hélène    | H??l?¿ne     | è     | 3f bf
etc.
  • I have tried all codepages (known to .Net) to import the file, and see if they contain the words I know. But no codepage gives me satisfaction.
  • Opening the file in Notepad++ thinks it is ANSI, and also shows the unwanted characters. (But it has a hex-editor plugin that is usefull).
  • Other files (from the same user & zipfile) are encoded in UTF-8.

From the guy I received the files from, I cannot expect help. (Using Google Translate) he made it clear to me that he found it very hard just to create the files, and he is using software (I believe SAP) that I do not have access to.

Is there any other way I can find the encoding of the files he just send to me?

GvS
  • 52,015
  • 16
  • 101
  • 139
  • whats does Notepad++ says the file is? look up in the bottom-right corner. UNICODE, ANSI, UTF-8, and what char-set? – balexandre Mar 11 '11 at 14:14
  • Notepad++ thinks its `ANSI`. But ansi does not contain characters above 7F (I was told). ba & bf certainly are larger. – GvS Mar 11 '11 at 14:16
  • you need to request that file again, in UTF-8 or UNICODE, You say that he uses a software, so I'm sure he has an option for this somewhere... – balexandre Mar 11 '11 at 14:21
  • Where is he from? He probably (unknowingly) defaulted to his codepage. – xanatos Mar 11 '11 at 14:25
  • 1
    @balexanre: I'm trying that right know. (Too bad he only speaks French and thinks Unicode is some kind of unicorn). But I also want to know how he managed to get this strange encoding. – GvS Mar 11 '11 at 14:27
  • 1
    I suspect you'll just have to identify all the special cases and manually search-replace them. Almost all (if not all) code pages keep the ASCII section 0-7F intact, so I can't imagine any would knowingly encode accents as question-mark sequences. – Rup Mar 11 '11 at 14:28

2 Answers2

6

I can get those results if I take UTF-8 encoded text, pretend it is CP850, and then convert it to Latin-1, Windows-1252, or a similar encoding. The "?" comes from the fact that the CP850 character at 0xc3 is "├", which doesn't exist in Latin-1 or derived encodings, so the conversion replaces it with a "?".


Edit: I did a bit wider of a search using iconv, and CP437, CP862, or CP865 are better matches than CP850. Since you asked, the one-liner I used this time was:

for enc in `iconv -l`; do echo -n "$enc: "; echo -n "ç é è" | iconv -s -f $enc -t "LATIN1//TRANSLIT" 2>/dev/null; echo; done
Anomie
  • 92,546
  • 13
  • 126
  • 145
  • How did you do this conversion? Written a small app or using some software? – GvS Mar 11 '11 at 14:40
  • I threw together a quick PHP script that did `mb_convert_encoding` on "ç" to look for what might give a result involving "º" when converted from every encoding listed by `mb_list_encodings` to UTF-8. That pointed me to CP850, and then I figured the "?" would probably be from a conversion to Latin-1 or another encoding more limited than Unicode. Although CP850 to Latin1 is not a perfect answer, it gives "é" as "?®" instead of "??". – Anomie Mar 11 '11 at 16:57
1

it should UTF-8 or UTF-16. they contains almost all regular characters. it looks like you have a decode/encode problem.

notepad++ it maybe confused, because your files do not use a Byte-Order-Mark.

how do you process your files?

try to read them as binary and then try different encodings to get a string. if you do not read them as binary, a default encoding may take place.

the "?" is a sign for that.

may be that helps out.

mo.
  • 3,474
  • 1
  • 23
  • 20