3

I'm trying to create a hashtable in Common Lisp to store characters as keys, but the hashtable doesn't work if I use accented characters. It only takes one possible key with accents.

In this example I add 5 keys, and see that the hashtable shows 5 elements, then add another 5 with accents, and the table shows 6 elements, then add another “normal” 5 elements and the size goes to 11 (as expected).

What is happening? And how can I solve this?

(defparameter *h* (make-hash-table))
(setf (gethash #\A *h*) #\A)
(setf (gethash #\E *h*) #\A)
(setf (gethash #\I *h*) #\A)
(setf (gethash #\O *h*) #\A)
(setf (gethash #\U *h*) #\A)
(hash-table-count *h*)
(setf (gethash #\á *h*) #\A)
(setf (gethash #\é *h*) #\A)
(setf (gethash #\í *h*) #\A)
(setf (gethash #\ó *h*) #\A)
(setf (gethash #\ú *h*) #\A)
(hash-table-count *h*)
(setf (gethash #\a *h*) #\A)
(setf (gethash #\e *h*) #\A)
(setf (gethash #\i *h*) #\A)
(setf (gethash #\o *h*) #\A)
(setf (gethash #\u *h*) #\A)
(hash-table-count *h*)
Manuel
  • 301
  • 2
  • 11
  • Which implementation are you using? Have you tried to use another `test` function when defining `*h*`? – Martin Buchmann Jun 30 '19 at 20:03
  • SBCL, and I don't understand what you mean with defining another “test” function. – Manuel Jun 30 '19 at 20:06
  • You can provide an optional `test` keyword to `make-hash-table` which determines which function is used to test for equality of two hash keys. Check its documentation in hyperspec. – Martin Buchmann Jun 30 '19 at 20:09
  • Yes, it doesn't work as expected with `eq`, `eql`, `equal`, or `equalp` but then I “reduced” the question to another, because `(eq #\É #\Á)` outputs `T`. – Manuel Jun 30 '19 at 20:18
  • I cannot check it myself at the moment, but my first question was regarding your CL implementation. Check its documentation if it uses UTF-8 right away of if it needs some extra configuration. – Martin Buchmann Jun 30 '19 at 20:21
  • 1
    That might be in the right track. If I execute sbcl from the terminal and use `(eq #\É #\Á)` it says `NIL` (I have the `(setf sb-impl::*default-external-format* :utf-8)` line in `~/.sbclrc`) but if I do it from the SublimeREPL in Sublime Text it says `T`. Thank you. Although this answer may remain unanswerable, I might delete it. – Manuel Jun 30 '19 at 20:27
  • The default test for hashtables is EQL. You also need to make sure that Lisp uses the right encoding when reading accented characters... – Rainer Joswig Jun 30 '19 at 22:10
  • Just use `char=` for your test. – Spenser Truex Jun 30 '19 at 22:40
  • `eq` is for testing if the operands *are the same object*, so whether or not it returns `T` for two constant characters is a complete gamble. `eql` is guaranteed to be `T` "if [the operands] are both characters that represent the same character." http://www.lispworks.com/documentation/HyperSpec/Body/f_eql.htm – Spenser Truex Jul 01 '19 at 01:34
  • If you want your code to be portable you *can't* use `char=` for the test function (see [CLHS](http://www.lispworks.com/documentation/HyperSpec/Body/f_mk_has.htm) for the functions you can use). –  Jul 01 '19 at 11:26
  • 1
    I would say that this is a bug in SBCL. If the external format is such that `#\É` and `#\Á` are not decoded properly, then the reader should diagnose that. Basically what seems to be going on here is that raw UTF-8 follows the blackslash, and that backlash syntax is just taking the first byte (which is `#xC3` for both characters), failing to diagnose the trailing junk. – Kaz Jul 01 '19 at 19:06

2 Answers2

4

From the SBCL manual:

On non-Unicode builds, the default external format is :latin-1.

You want to use UTF-8. So do what the manual says, and set your environment up when you call SBCL:

$ LANG=C.UTF-8 sbcl --noinform --no-userinit --eval "(print (map 'string #'code-char (list 97 98 246)))" --quit
"abö"
$ LANG=C sbcl --noinform --no-userinit --eval "(print (map 'string #'code-char (list 97 98 246)))" --quit
"ab?"

If you use SLIME or Sly from Emacs, there is a way to set it up in your init:

(setq sly-lisp-implementations
      '((sbcl ("/opt/sbcl/bin/sbcl") :coding-system utf-8-unix)))

Then use a sane test function, like char=. You should use the most specific predicate whenever possible, in my opinion. char-equal is the case-insensitive version.

Sly manual, though the above snippet works on SLIME too as slime-lisp-implemetations

As noted in the comment by @Manuel if your LANG variable and friends do not use UTF-8, then you are doomed. See this quetsion

Spenser Truex
  • 963
  • 8
  • 24
  • Thanks. My problem is then with SublimeREPL. I don't know how to find a way to make it load the utf8 when opening. Is there a persistent way of writting `LANG=C.UTF-8` so that it's “read” when executing `sbcl`? Because I cannot tell SublimeREPL to “execute” `LANG=C.UTF-8 sbcl` it says _FileNotFoundError(2, "No such file or directory: 'LANG=C.UTF-8 sbcl'")_. Thanks – Manuel Jul 01 '19 at 05:53
  • 1
    You can write a "start-sbcl.sh" executable script that does the right thing, and point SublimeREPL to that script. – coredump Jul 01 '19 at 07:41
  • 1
    `eql`, the default test function, would be the same as using `char=`? `char=` isn't a valid test function to `make-hash-table` so if it isn't then `equal` or `equalp` (case insensitive) is the right choice. – Sylwester Jul 01 '19 at 12:26
  • @coredump I will try (although if it's not straightforward I don't know if I will be able to do it), but that might be what I need. Thanks – Manuel Jul 01 '19 at 13:19
  • Alright, I solved it. In any case what I needed was what @coredump said, make my own script, but instedo of `LANG=...` I needed was `LC_CTYPE=es_ES.UTF-8` (taken from [here](https://stackoverflow.com/a/22823923/1834416)), the version by Spenser Truex did not work. I still don't understand but it works. – Manuel Jul 01 '19 at 13:33
  • @SpenserTruex if you edit the answer mentioning `LC_CTYPE` I can mark this answer as valid ;) – Manuel Jul 01 '19 at 13:55
  • It seems that you need UTF-8 all the way down. – Spenser Truex Jul 02 '19 at 19:56
2

If, for whatever reason, you cannot change SBCL's default external fomat, you can always use #\LATIN_SMALL_LETTER_A_WITH_ACUTE, etc.

peter.cntr
  • 308
  • 2
  • 8
  • Yes, I know this. The thing is that SublimeREPL doesn't behave as expected and doesn't use the utf-8 external format, and I don't know how to do it. – Manuel Jul 01 '19 at 05:53