isLetter with accented characters in C

Question

I'd like to create (or find) a C function to check if a char c is a letter... I can do this for a-z and A-Z easily of course.

However i get an error if testing c == á,ã,ô,ç,ë, etc

Probably those special characters are stored in more then a char...

I'd like to know: How these special characters are stored, which arguments my function needs to receive, and how to do it? I'd also like to know if are there any standard function that already does this.

score 4 · Answer 1 · answered Apr 09 '11 at 12:08

I think you're looking for the iswalpha() routine:

   #include <wctype.h>

   int iswalpha(wint_t wc);

DESCRIPTION
   The iswalpha() function is the wide-character equivalent of
   the isalpha(3) function.  It tests whether wc is a wide
   character belonging to the wide-character class "alpha".

It does depend upon the LC_CTYPE of the current locale(7), so its use in a program that is supposed to handle multiple types of input correctly simultaneously might not be ideal.

score 3 · Accepted Answer · answered Apr 09 '11 at 13:15

3

If you are working with single-byte codesets such as ISO 8859-1 or 8859-15 (or any of the other 8859-x codesets), then the isalpha() function will do the job if you also remember to use setlocale(LC_ALL, ""); (or some other suitable invocation of setlocale()) in your program. Without this, the program runs in the C locale, which only classifies the ASCII characters (8859-x characters in the range 0x00..0x7F).

If you are working with multibyte or wide character codesets (such as UTF8 or UTF16), then you need to look to the wide character functions found in <wchar.h> and <wctype.h>.

answered Apr 09 '11 at 13:15

Jonathan Leffler

730,956
141
904
1,278

ISO 8859-1 would be perfect. Acording to you, this is single byte..so do I need to declare chars like 'ç' or 'á' as wchar_t or i can use the char? And...how do I use setlocale() to use ISO 8859-1 ? Also, if I can use char, how can I, for example, declair a char variable which will contain 'á' ? I'm sorry for doing so many questions, but I'm very unfamiliar with this topic. – jmacedo Apr 09 '11 at 13:52
when i do this: char o = 'ç'; gcc tells me: main.c:9:11: warning: multi-character character constant main.c: In function ‘main’: main.c:9: warning: overflow in implicit constant conversion – jmacedo Apr 09 '11 at 14:44
@joxnas: what environment settings have you got for LANG, LC_ALL, LC_CTYPE? What sort of terminal emulator are you using? What is its codeset? The compiler warnings make it sound as if you have a UTF-8 terminal emulator - not at all uncommon these days. And a lot depends on the C library you are using. – Jonathan Leffler Apr 09 '11 at 17:28
locale command gives me these: LANG=en_US.utf8;LC_CTYPE="en_US.utf8";LC_ALL=""; the terminal emulator is GNOME Terminal 2.30.2 There's an option in the menu bar to set the encoding to ISO 8859-1 and also in gvim (the editor i use), there's the :set fileencoding command. I tryed to set the fileencoding to latin1 (ISO 8859-1) in vim, and the `char o = 'ç';` now works good. However, to read correctly the accented characters from input(getchar) to 1 byte char, I also needed to set the encoding in terminal too. Does setlocale() allow me to control this inside my program? – jmacedo Apr 09 '11 at 18:35
@joxnas: No, `setlocale()` does not control the terminal attributes. There is most probably a way to do it programmatically, but it will require some manual bashing - quite extensive manual bashing, in all likelihood. – Jonathan Leffler Apr 10 '11 at 03:43

Ben Stott · Answer 3 · 2011-04-09T12:36:37.217

1

How these characters are stored is locale-dependent. On most UNIX systems, they'll be stored as UTF8, whereas a Win32 machine will likely represent them as UTF16. UTF8 is stored as a variable-amount of chars, whereas UTF16 is stored using surrogate pairs - and thus inside a wchar_t (or unsigned short) (though incidentally, sizeof(wchar_t) on Windows is only 2 (vs 4 on *nix), and thus you'll often need 2 wchar_t types to store the 1 character if a surrogate pair encoding is used - which it will be in many cases).

As was mentioned, the iswalpha() routine will do this for you, and is documented here. It should take care of locale-specific issues for you.

edited Apr 09 '11 at 12:36

answered Apr 09 '11 at 12:15

Ben Stott

2,218
17
23

1

No that is not correct, UTF16 is also variable length it is just that each code point is 16 bits instead of 8 as in UTF8. – AndersK Apr 09 '11 at 12:33
Yes, however you will only ever need one wchar_t type per character. Oh, actually, unless you're on a Windows machine which stores them as 16 bits instead of 32. Yep, you're correct. – Ben Stott Apr 09 '11 at 12:35
Nit: You'll rarely need two wchar_t, not often. – ikegami Apr 09 '11 at 12:43

score 1 · Answer 4 · answered Apr 09 '11 at 12:43

1

You probably want http://site.icu-project.org/. It provides a portable library with APIs for this.

answered Apr 09 '11 at 12:43

bmargulies

97,814
39
186
310

isLetter with accented characters in C

4 Answers4