1

I'm trying to figure the best way of encoding text (either 8-bit ubyte[] or string) to its HTML counterpart.

My proposal so far is to use a lookup-table to map the 8-bit characters

string[256] lutLatin1ToHTML;
lutLatin1ToXML[0x22] = "&quot";
lutLatin1ToXML[0x26] = "&amp";
...

in HTML that have special meaning using the function

pure string toHTML(in string src,
                   ref in string[256] lut) {
    return src.map!(a => (lut[a] ? lut[a] : new string(a))).reduce!((a, b) => a ~ b) ;
}

I almost work except for the fact that I don't know how to create a string from a `ubyte? (the no-translation case).

I tried

writeln(new string('a'));

but it prints garbage and I don't know why.

For more details on HTML encoding see https://en.wikipedia.org/wiki/Character_entity_reference

Nordlöw
  • 11,838
  • 10
  • 52
  • 99

2 Answers2

2

You can make a string from a ubyte most easily by doing "" ~ b, for example:

ubyte b = 65;
string a = "" ~ b;
writeln(a); // prints A

BTW, if you want to do a lot of html stuff, my dom.d and characterencodings.d might be useful: https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

It has a html parser, dom manipulation functions similar to javascript (e.g. ele.querySelector(), getElementById, ele.innerHTML, ele.innerText, etc.), conversion from a few different character encodings, including latin1, and outputs ascii safe html with all special and unicode characters properly encoded.

assert(htmlEntitiesEncode("foo < bar") == "foo &lt; bar";

stuff like that.

Adam D. Ruppe
  • 25,382
  • 4
  • 41
  • 60
  • 2
    I should probably add that "" ~ 128 won't work - that will probably eventually complain about invalid utf-8 sequence. It won't up front though, so you can build the string one byte at a time. Just make sure you're appending values b < 128 - ascii - or take care to encode the others in proper utf8 format. But if you HTML encode them all, you'll be fine anyway since that's all ascii. – Adam D. Ruppe Sep 23 '13 at 21:45
1

In this case Adam's solution works just fine, of course. (It takes advantage of the fact that ubyte is implicitly convertible to char, which is then appended to the immutable(char)[] array for which string is an alias.)

In general the safe way of converting types is to use std.conv.

import std.stdio, std.conv;

void main() {
    // utf-8
    char cc = 'a';
    string s1 = text(cc);
    string s2 = to!string(cc);
    writefln("%c %s %s", cc, s1, s2);

    // utf-16
    wchar wc = 'a';
    wstring s3 = wtext(wc);
    wstring s4 = to!wstring(wc);
    writefln("%c %s %s", wc, s3, s4);    

    // utf-32
    dchar dc = 'a';
    dstring s5 = dtext(dc);
    dstring s6 = to!dstring(dc); 
    writefln("%c %s %s", dc, s5, s6);

    ubyte b = 65;
    string a = to!string(b);
} 

NB. text() is actually intended for processing multiple arguments, but is conveniently short.

fwend
  • 1,813
  • 2
  • 15
  • 17
  • Indeed, though note that to!string(ubyte) will give a number like "65" rather than A. You could do to!string(cast(char) ubyte), or cast to wchar/dchar too, and that would work. – Adam D. Ruppe Sep 24 '13 at 21:10