2

I have a string like café and I need to translate it to cafe. I tried (string-normalize-nfd "café") but it returns cafe a quotation mark with an accent, and `(string-normalize-nfd alguém) returns alguem with accent on m. How can I translate the accented string to a non-accented string?

Óscar López
  • 232,561
  • 37
  • 312
  • 386
Vitor F.M.
  • 119
  • 7

3 Answers3

4

I can't think of a built-in procedure that does what you need, but it's easy to write your own implementation:

; maps accented chars to unaccented chars
(define translate
  '#hash((#\á . #\a)
         (#\é . #\e)
         (#\í . #\i)
         (#\ó . #\o)
         (#\ú . #\u)))

(define (remove-accents str)
  (apply string ; convert char list back into string
         ; for each char: replace it with non-accented
         ; version, if not present leave it unmodified
         (map (λ (c) (hash-ref translate c (const c)))
              (string->list str)))) ; convert string to char list

Be sure to add more mappings as needed, for instance to include uppercase chars, etc. It works as expected:

(remove-accents "café")
=> "cafe"
Óscar López
  • 232,561
  • 37
  • 312
  • 386
3

Your question is not really one about Racket; it's about Unicode normalization. The function that you're referring to performs the "Canonical Normalization" described on this page.

It appears to me that the best way to do what you want might be to perform the normalization and then strip out any accent characters, if you know that the original string doesn't contain accent characters.

John Clements
  • 16,895
  • 3
  • 37
  • 52
  • The Racket [`string-normalize-{nfc nfd nfkc nfkd}`](https://docs.racket-lang.org/reference/strings.html#%28def._%28%28quote._~23~25kernel%29._string-normalize-nfd%29%29) functions do what's described on that page. – Greg Hendershott Aug 04 '18 at 17:57
3

You have the right idea to use string-normalize-nfd -- and it's actually working! It's just that Racket strings are UTF-8 and print composed or decomposed the same.

(string-normalize-nfd "café") ;Racket prints UTF-8 string as "café"

You can see that it worked, if you convert the string to bytes:

(string->bytes/utf-8 (string-normalize-nfd "café")) ;#"cafe\314\201"

Given that, here's a rough cut at a function. I'd be surprised if this were exactly correct for all cases. But hopefully it's enough to get you on your way and you can refine it.

(define (ascii-ize s)
  (list->string
   (for/list ([b (in-bytes (string->bytes/utf-8
                            (string-normalize-nfd s)))]
              #:when (< b 128))
     (integer->char b))))

(ascii-ize "café")   ;"cafe"
(ascii-ize "alguém") ;"alguem"
Greg Hendershott
  • 16,100
  • 6
  • 36
  • 53