0

As I dig further into the use of DCGs, I find myself stymied by wide characters.

I am trying to write a (more or less) general purpose tokenizer, and am testing its mettle against this text file of Macbeth (which I came across in a recent /r/dailypogrammer challenge). Hidden here and there amongst the text is the wide character Ã.

For a long while I was at a loss, and no matter what tweaks I tried my, tokenizer merely replied "false". Of course, I finally caught on: because my DCG rules attempt to tokenize their data by appeals to code_type/2, delimiting "words" as contiguous chars of type csym, separating out punctuation with char_type(C, punct), etc, they fail when they encounter Ã, which is represented as [195, 131].

Having identified the problem, I am at a loss as to how how to deal with these code sequences cleanly. Ideally, I'd like to group all graphic characters, of whatever width, as parts of "words", unless they are explicit punctuation symbols. I have tried reading the file under various different encodings, and that doesn't seem to help, presumably because I am still relying on code_type/2.

For the moment, I've contrived the following unsavory solution as a catchall for any non-ascii character:

% non ascii's
nasciis([C])     --> nascii(C), (call(eos), !).
nasciis([C]),[D] --> nascii(C), [D], {D < 127}.
nasciis([C|Cs])  --> nascii(C), nasciis(Cs).

nascii(C)        --> [C], {C > 127}.

But I'm sure, or at least I'd hope, there is a better way to approach this problem.

How do others deal with this kind of scenario? Is there a standard approach? Am I overlooking something simple?

Shon
  • 3,989
  • 1
  • 22
  • 35
  • 1
    Have you read this: http://www.swi-prolog.org/pldoc/man?section=widechars ? Anyway, what is the purpose of having these à characters there? Seems like a problem with the encoding to start with? In this sense, I guess it's fine that you get "false" :) It is a whole other question how to gracefully deal with dirty input. –  Mar 05 '15 at 10:33
  • 1
    PS. I am quite certain that this weird character is there as a consequence of a messed up encoding at some point of time. –  Mar 05 '15 at 10:40
  • 2
    PPS Keep in mind that the character properties actually depend on your locale! For me, for example, `?- code_type(0'Ã, csym).` says true. –  Mar 05 '15 at 11:03
  • Thanks, @Boris! I had looked at the SWI docs on wide character support. I subsequently tried different encodings while reading in my data, and while this did result in... well, different encodings... it didn't help with parsing wide characters. I think you're absolutely right that in this case the `Ã` is a result of an encoding error—so damaged data—but if I were trying to parse, e.g., twitter streams, I would have tons of wide characters. It would be nice to have a way of dealing with these other than my method. Would like to have something non limited by local too! Thanks for your input :) – Shon Mar 07 '15 at 21:29

0 Answers0