As I dig further into the use of DCGs, I find myself stymied by wide characters.
I am trying to write a (more or less) general purpose tokenizer, and am testing its mettle against this text file of Macbeth (which I came across in a recent /r/dailypogrammer challenge). Hidden here and there amongst the text is the wide character Ã
.
For a long while I was at a loss, and no matter what tweaks I tried my, tokenizer merely replied "false". Of course, I finally caught on: because my DCG rules attempt to tokenize their data by appeals to code_type/2
, delimiting "words" as contiguous chars of type csym
, separating out punctuation with char_type(C, punct)
, etc, they fail when they encounter Ã
, which is represented as [195, 131]
.
Having identified the problem, I am at a loss as to how how to deal with these code sequences cleanly. Ideally, I'd like to group all graphic characters, of whatever width, as parts of "words", unless they are explicit punctuation symbols. I have tried reading the file under various different encodings, and that doesn't seem to help, presumably because I am still relying on code_type/2
.
For the moment, I've contrived the following unsavory solution as a catchall for any non-ascii character:
% non ascii's
nasciis([C]) --> nascii(C), (call(eos), !).
nasciis([C]),[D] --> nascii(C), [D], {D < 127}.
nasciis([C|Cs]) --> nascii(C), nasciis(Cs).
nascii(C) --> [C], {C > 127}.
But I'm sure, or at least I'd hope, there is a better way to approach this problem.
How do others deal with this kind of scenario? Is there a standard approach? Am I overlooking something simple?