5

I have a code that if executed from the slime prompt inside emacs run with no error. If I started sbcl from the prompt, I got the error:

* (ei:proc-file "BRAvESP000.log" "lixo")

debugger invoked on a SB-INT:STREAM-ENCODING-ERROR:
  :UTF-8 stream encoding error on
  #<SB-SYS:FD-STREAM for "file /Users/arademaker/work/IBM/scolapp/lixo"
    {10049E8FF3}>:

    the character with code 55357 cannot be encoded.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [OUTPUT-NOTHING    ] Skip output of this character.
  1: [OUTPUT-REPLACEMENT] Output replacement string.
  2: [ABORT             ] Exit debugger, returning to top level.

(SB-IMPL::STREAM-ENCODING-ERROR-AND-HANDLE #<SB-SYS:FD-STREAM for "file /Users/arademaker/work/IBM/scolapp/lixo" {10049E8FF3}> 55357)
0]

The mistery is that in both cases I am using the same sbcl 1.1.8 and the same machine, Mac OS 10.8.4. Any idea?

The code:

(defun proc-file (filein fileout &key (fn-convert #'identity))
  (with-open-file (fout fileout
                   :direction :output
                   :if-exists :supersede
                   :external-format :utf8)
    (with-open-file (fin filein :external-format :utf8)
      (loop for line = (read-line fin nil)
        while line
        do 
        (handler-case
        (let* ((line (ppcre:regex-replace "^.*{jsonTweet=" line "{\"jsonTweet\":"))
               (data (gethash "jsonTweet" (yason:parse line))))
          (yason:encode (funcall fn-convert (yason:parse data)) fout)
          (format fout "~%"))
          (end-of-file ()
        (format *standard-output* "Error[~a]: ~a~%" filein line)))))))
Rainer Joswig
  • 136,269
  • 10
  • 221
  • 346
Alexandre Rademaker
  • 2,683
  • 2
  • 19
  • 21
  • 1
    I suggest that you start by assuming this isn't a yason problem -- we'll find out quickly if it is -- and add the following to your code: `(format *standard-output* "~&~{~x~^ ~}" (map 'list 'char-code line))`. Is the final line in the failing case the same as the corresponding line in the SLIME environment? – Nick Levine Jul 17 '13 at 10:17
  • 1
    Maybe there are files named BRAvESP000.log in more than one directory, and the current directory is different if you're in SLIME or if you're launching SBCL manually. Try absolute paths. – acelent Jul 19 '13 at 12:49
  • If the character code is not a mistake, it belongs to the Unicode range of surrogate pairs. These aren't characters of UTF-8 encoding, they are reserved for use with UTF-16. Here's my guess: there's a modern tradition in web design to use private plane characters together with a special font to serve as icons (such as various arrows, bullets and so on). Twitter, in particular, does that (so does Github for example too). This is a way for an HTML page to save on loading images (as these are vector outlines from a special font). I would imagine that Emacs deals with them before sending. –  Oct 17 '13 at 21:28
  • But SBCL on its own doesn't. I think you would be fine if you simply delete those, or replace with something less offensive. –  Oct 17 '13 at 21:29

1 Answers1

1

This is almost certainly a bug in yason. JSON requires that if a non BMP character is escaped, it is done so through a surrogate pair. Here's a simple example with U+10000 (which is optionally escaped in json as "\ud800\udc00"; I use babel as babel's conversion is less strin):

(map 'list #'char-code (yason:parse "\"\\ud800\\udc00\"")) 
  => (55296 56320)

unicode code point 55296 (decimal) is the start for a surrogate pair, and should not appear except as a surrogate pair in UTF-16. Fortunately it can be easily worked around by using babel to encode the string to UTF-16 and back again:

(babel:octets-to-string (babel:string-to-octets (yason:parse "\"\\ud800\\udc00\"") :encoding :utf-16le) :encoding :utf-16le)
  => ""

You should be able to work around this by changing this line:

(yason:encode (funcall fn-convert (yason:parse data)) fout)

To use an intermediate string, which you convert to UTF-16 and back.

(write-sequence
 (babel:octets-to-string
  (babel:string-to-octets
   (with-output-to-string (outs)
    (yason:encode (funcall fn-convert (yason:parse data)) outs))
   :encoding :utf-16le)
  :encoding :utf-16le)
 fout)

I submitted a patch that has been accepted to fix this in yason:

https://github.com/hanshuebner/yason/commit/4a9bdaae652b7ceea79984e0349a992a5458a0dc

Jason
  • 1,059
  • 9
  • 13