Converting Text to HTML In D

Question

I'm trying to figure the best way of encoding text (either 8-bit ubyte[] or string) to its HTML counterpart.

My proposal so far is to use a lookup-table to map the 8-bit characters

string[256] lutLatin1ToHTML;
lutLatin1ToXML[0x22] = "&quot";
lutLatin1ToXML[0x26] = "&amp";
...

in HTML that have special meaning using the function

pure string toHTML(in string src,
                   ref in string[256] lut) {
    return src.map!(a => (lut[a] ? lut[a] : new string(a))).reduce!((a, b) => a ~ b) ;
}

I almost work except for the fact that I don't know how to create a string from a `ubyte? (the no-translation case).

I tried

writeln(new string('a'));

but it prints garbage and I don't know why.

For more details on HTML encoding see https://en.wikipedia.org/wiki/Character_entity_reference

score 2 · Accepted Answer · answered Sep 23 '13 at 21:41

You can make a string from a ubyte most easily by doing "" ~ b, for example:

ubyte b = 65;
string a = "" ~ b;
writeln(a); // prints A

BTW, if you want to do a lot of html stuff, my dom.d and characterencodings.d might be useful: https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff

It has a html parser, dom manipulation functions similar to javascript (e.g. ele.querySelector(), getElementById, ele.innerHTML, ele.innerText, etc.), conversion from a few different character encodings, including latin1, and outputs ascii safe html with all special and unicode characters properly encoded.

assert(htmlEntitiesEncode("foo < bar") == "foo &lt; bar";

stuff like that.

I should probably add that "" ~ 128 won't work - that will probably eventually complain about invalid utf-8 sequence. It won't up front though, so you can build the string one byte at a time. Just make sure you're appending values b < 128 - ascii - or take care to encode the others in proper utf8 format. But if you HTML encode them all, you'll be fine anyway since that's all ascii. — Adam D. Ruppe, Sep 23 '13 at 21:45

fwend · Answer 2 · 2013-09-25T14:01:38.467

In this case Adam's solution works just fine, of course. (It takes advantage of the fact that ubyte is implicitly convertible to char, which is then appended to the immutable(char)[] array for which string is an alias.)

In general the safe way of converting types is to use std.conv.

import std.stdio, std.conv;

void main() {
    // utf-8
    char cc = 'a';
    string s1 = text(cc);
    string s2 = to!string(cc);
    writefln("%c %s %s", cc, s1, s2);

    // utf-16
    wchar wc = 'a';
    wstring s3 = wtext(wc);
    wstring s4 = to!wstring(wc);
    writefln("%c %s %s", wc, s3, s4);    

    // utf-32
    dchar dc = 'a';
    dstring s5 = dtext(dc);
    dstring s6 = to!dstring(dc); 
    writefln("%c %s %s", dc, s5, s6);

    ubyte b = 65;
    string a = to!string(b);
}

NB. text() is actually intended for processing multiple arguments, but is conveniently short.

Indeed, though note that to!string(ubyte) will give a number like "65" rather than A. You could do to!string(cast(char) ubyte), or cast to wchar/dchar too, and that would work. — Adam D. Ruppe, Sep 24 '13 at 21:10

Converting Text to HTML In D

2 Answers2