0

I have a legacy application written in Delphi which uses a Jet Database as its back-end for storing data and I need to export the data to a new format.

Opening the database with MS Access (Windows) or MDBViewer (Linux), in fields of type "MEMO" (mysql's TEXT equivalent) all I can see is garbage which resembles Asian characters. Running the legacy application the fields' contents show up correctly.

Is there a way I can try every possible character encoding and convert it to recover the data (I'm comfortable with PHP and C#)? I read something about BOM (byte-order marker), that might be related, any ideas?

Thanks!

Tony Toews
  • 7,850
  • 1
  • 22
  • 27
cmpscabral
  • 183
  • 3
  • 13

1 Answers1

2

Current MS Access versions use UTF-8 to store string values. Older ones simply followed the code page of the machine on which the text was entered.

Most encodings do indeed use some marker bytes to indicate the encoding of what follows. Whether or not you have the benefit of that, really depends on the legacy app. If that simply followed a single encoding, or relied on the machine's code page, then you'll have to do some clever recognizing yourself.

Quick checks

UTF-8

If there is a marker, it would be $EFBBBF. If there isn't, you can make an educted guess that it is UTF-8 when sequences of ASCII (0-127) characters can be seen in the string.

UTF-16

Comes in two flavours: Little Endian (LE) and Big Endian (BE). For characters within the Basic Multilingual Plane, both use two bytes per character. The difference between the two is that for ASCII characters, one starts with a zero byte, the other ends with it.

If there is a marker UTF-16LE is designated by $FFFE and UTF-16BE by $FEFF. If neither of those markers is present having alternating zero and non-zero bytes in the memo field is a fair indication. And your first bet should be UTF-16LE as that is the windows standard and UTF-16BE is used a lot less. (Sorry, can never remember which of the two starts with a zero-byte for ASCII characters and which one starts with a non-zero byte).

Other

If you can exclude UTF-8 and UTF-16, you could try to figure out whether one of the other UTF encodings was used. I wouldn't spend the time though, chances are that the program simply relied on the machine's code page. Seeing as your are dealing with a lot of "asian looking" characters, your best bet would be to check for the MBCS code pages (Multi Byte Character S??? code pages). See MSDN for more details. As I have never dealt with them myself, I'm afraid I can't be of more help here though.

Trying encodings

If you do have to start trying out every encoding there is, you may want to have a look at the DIConvertors library. It's pretty good at converting between encodings. IIRC it can also recognize encodings, but otherwise it should help getting you started with your own detection. It can be found at http://www.yunqa.de/delphi/doku.php/products/converters/index

Marjan Venema
  • 19,136
  • 6
  • 65
  • 79
  • thanks for your reply. All I can see, regardless of the field's contents is this: 瑲ㅦ慜獮屩敤晦笰晜湯瑴汢屻て晜楮屬捦慨獲瑥‰牔扥捵敨⁴卍紻屻ㅦ晜楮牔扥捵敨⁴卍 My bet will be the differences between the machine's code page that's causing this problems. – cmpscabral Jan 17 '11 at 14:24
  • 1
    @cmpscabral - That's ANSI text incorrectly regarded as UTF-16LE. And the content is part of some 'rtf' header: `rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Trebuchet MS;}{\f1\fnil Trebuchet MS`. Save an empty unicode text file with Notepad, paste your weird string and examine the contents with a hex editor, and you'll see.. And +1 to the answer.. – Sertac Akyuz Jan 17 '11 at 16:09
  • Sertac, thanks for your reply. How did you manage to convert it? Can you give an example? I can see the hexadecimal codes, but what can I do with them? – cmpscabral Jan 17 '11 at 16:52
  • @cmpscabral - I didn't convert it, I just copied the text from the text panel from the hex editor, every hex editor should have one such panel. But that's also the point, you shouldn't convert it since the text is not unicode to begin with: Remove the BOM if there's one (0xFFFE as in the answer), and then cast the remaining to a variable with type of AnsiString. – Sertac Akyuz Jan 17 '11 at 17:33
  • @Sertac, @cmpscabral: good pick up Sertac, thanks. It's what I thought when I saw cmpscabral's example string, but didn't have time to check and confirm it at that point... – Marjan Venema Jan 17 '11 at 20:17
  • 1
    A wrinkle for all of this is that from Jet 4 on (the first Jet version to support Unicode), there's also the Unicode compression property of text and memo fields. That's on by default, so it might have an effect on how the information is stored. But I'm rather puzzled as to why Access itself can't properly decode and display the information -- something doesn't see quite right there. – David-W-Fenton Jan 18 '11 at 00:27
  • using mb_convert_encoding([my_string],'UTF-16LE','UTF-8') in PHP works like a charm :) thanks everyone! – cmpscabral Jan 18 '11 at 16:03