0

I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string. For example, "ABCģķī" would result in "ABCģķī"

Now 2 basic things:

  1. Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
  2. Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.

So - How do I get a value of a char, that is in 1-255 range?

I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.

Is there any other method for returning char value? Any help appreciated

user3060709
  • 1
  • 1
  • 1
  • In Delphi 7, `ord(c)` where `c` is of type `char` is in the range `0..255`. So clearly you have something other than `char`. What is it that you have? Do you have any code? – David Heffernan Dec 03 '13 at 09:52
  • 1
    Where do the chars you want to encode come from, which format are they in ? As David mentioned, it would be very helpful if you'd post the code with the missing proper conversion as a starting point for discussion. – DNR Dec 03 '13 at 10:40
  • You need a `WideChar` and `WideString` types instead. Typecast `Word` sized value to WideChar and vice versa. – Free Consulting Dec 03 '13 at 11:32
  • @FreeConsulting Given that `ord(somechar)` can return values >255 it would seem that we already have `WideChar`, but it's not much fun guessing – David Heffernan Dec 03 '13 at 11:58
  • Just back from lunch break :) @DavidHeffernan I use `WideString` as input, get value from `WideChar` and `String` as output. @Heina `WideString` for the conversion comes from a file. – user3060709 Dec 03 '13 at 12:10
  • Well, clearly you will get values past 255, your string is UTF-16 encoded. Do you understand the different between the 16 bit UTF-16 encoding and 8 bit ANSI? Why don't you use UTF-8 here? That would make a whole lot more sense. You must not convert to 8-bit ANSI since that will result in you losing your data. You don't want that. If you want to convert to numbered entities, then there are many libraries in existence that do that. Why do you feel compelled to re-invent this wheel. Odds are that when you do so, it won't be as round as the tried and tested ones. – David Heffernan Dec 03 '13 at 12:16
  • @DavidHeffernan, SGML entities encoded text will FIT into 7bit ASCII. – Free Consulting Dec 03 '13 at 12:41
  • @FreeConsulting Yes that is true. I don't see the relevance to the matter at hand. – David Heffernan Dec 03 '13 at 12:46

3 Answers3

1

I suggest you avoid codepages like the plague.

There are two approaches for Unicode that I'd consider: WideString, and UTF-8.

Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.

UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.

@DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.

Community
  • 1
  • 1
Roddy
  • 66,617
  • 42
  • 165
  • 277
  • Some mis-information here. For some text, UTF-16 is more compact than UTF-8. I'm thinking of Chinese for instance. And you are incorrect in your statement that `WideString` does not cover the full Unicode space. The `WideString` type wraps the COM `BSTR` which is encoded using `UTF-16` which is a full encoding of Unicode. It is a variable length encoding just like UTF-8. – David Heffernan Dec 03 '13 at 09:56
  • @DavidHeffernan I will not be dealing with Chinese :) As a matter of fact, my program should be able to encode to/from 2 codepages - 1257 and 1251 (Baltic & Cyrillic). – user3060709 Dec 03 '13 at 12:22
  • @Roddy Compactness doesn't play any significant role. (As long as it works, it's fine) – user3060709 Dec 03 '13 at 12:22
  • @user3060709 So, which ANSI encoding supports both of those code pages at once? Why are you even thinking about ANSI? What have you got against Unicode? – David Heffernan Dec 03 '13 at 12:28
1

In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABC&#291;&#311;&#299;" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABC&#196;&#163;&#196;&#183;&#196;&#171;" or "ABC&#xC4;&#xA3;&#xC4;&#xB7;&#xC4;&#xAB;" instead. Most other charsets do no support those particular Unicode characters.

In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;&#311;&#299;" or "ABC&#x0123;&#x0137;&#x012B;".

So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:

  1. if you need HTML 4:

    A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.

    B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.

  2. If you need HTML 5:

    A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.

    B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
0

In case I understood the OP correctly, I'll just leave this here.

function Entitties(const S: WideString): string;
var
  I: Integer;
begin
  Result := '';
  for I := 1 to Length(S) do
  begin
    if Word(S[I]) > Word(High(AnsiChar)) then
      Result := Result + '#' + IntToStr(Word(S[I])) + ';'
    else
      Result := Result + S[I];
  end;
end;
Free Consulting
  • 4,300
  • 1
  • 29
  • 50