2

I am trying to detect file encoding using LispWorks.

LispWorks should be capable of such functionality, see External Formats and File Streams.

[Note: details based on @rainer-joswig and @svante comments]

system:*file-encoding-detection-algorithm* is set to its default,

(setf system:*file-encoding-detection-algorithm*
      '(find-filename-pattern-encoding-match
       find-encoding-option
       detect-utf32-bom
       detect-unicode-bom
       detect-utf8-bom
       specific-valid-file-encoding
       locale-file-encoding))

And also,

;; Specify the correct characters
(lw:set-default-character-element-type 'cl:character)

Some verifiable files available here:

UNICODE and LATIN-1 are properly detected

;; UNICODE
;; http://www.humancomp.org/unichtm/tongtwst.htm
(with-open-file (ss "/tmp/tongtwst.htm")
  (stream-external-format ss))
;; => (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)

;; LATIN-1
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)

Detecting UTF-8 does not work right away,

;; UTF-8 encoding
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)

Adding UTF-8 to *specific-valid-file-encodings* makes it work,

(pushnew :utf-8 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:UTF-8)

;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
  (stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :CRLF)

But now same LATIN-1 file as above is detected as UTF-8,

(with-open-file (ss "/tmp/windows-1252-2000.ucm")
  (stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :LF)

Pushing LATIN-1 to *specific-valid-file-encodings* as well,

(pushnew :latin-1 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:LATIN-1 :UTF-8)

;; This one works again
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)

;; But this one, which was properly detected as `UTF-8`,
;; is now detected as `LATIN-1`, *which is wrong.*
(with-open-file (ss "/tmp/tongtws8.htm")
  (stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)

What I am doing wrong?

How can I correctly detect file encoding using LispWorks?

gsl
  • 1,063
  • 1
  • 16
  • 27
  • 1
    I use: `(pushnew :utf-8 system:*specific-valid-file-encodings*)` – Rainer Joswig Aug 17 '19 at 16:17
  • Doing that I get `(with-open-file (ss "/lisp/test-utf8.log") (stream-external-format ss))` `Error: External format (:UTF-8 :EOL-STYLE :CRLF) produces characters of type SIMPLE-CHAR, which is not a subtype of the specified element-type BASE-CHAR.` – gsl Aug 17 '19 at 16:42
  • 1
    right, you also need to specify the correct characters. You can do that in WITH-OPEN-FILE and `:element-type`, IIRC. `cl:character` should work. I set it system wide in my `.lispworks` file: `(lw:set-default-character-element-type 'cl:character)`. – Rainer Joswig Aug 17 '19 at 16:47
  • 1
    Hmm, you want to experiment a bit with the settings and the function `system:guess-external-format`. Since I don't use windows, I don't see the same effects. A question on the LispWorks mailing list might generate some answers... – Rainer Joswig Aug 17 '19 at 17:01
  • 1
    This may be obvious, but there need to be some characters with a code > 127 in your file for such a heuristic to work. – Svante Aug 19 '19 at 16:03
  • @RainerJoswig @ svante: Thank you for valuable advice. I have added details based on your comments. Some files are detected fine, some are not. Please see question. – gsl Aug 19 '19 at 17:48
  • The `.ucm` file is an ASCII file. ASCII is a subset of both UTF-8 and Latin-1, so whatever gets checked first is chosen. – Svante Aug 21 '19 at 07:54
  • Thank you, that makes sense. Would you like to make a short answer, so I can accept it as best answer? Or I could do it, quoting your comment, if you wish. – gsl Aug 21 '19 at 13:22

0 Answers0