0

Is it possible to do setlocale(LC_CTYPE, "ru_RU.utf8") and for each symbol of string "рус eng" do isaplha() check and to get as result following:

р alpha
у alpha
с alpha
  not alpha
e not alpha
n not alpha
g not alpha

now when I am setting locale ru_RU.utf8 all symbols except space symbol are alpha

dmigous
  • 167
  • 1
  • 3
  • 14

1 Answers1

3

The isalpha function asks the question:

The isalpha() function shall test whether c is a character of class alpha in the program's current locale.

and goes on to note:

The c argument is an int, the value of which the application shall ensure is representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Which means that it only works for ascii characters.

The test is pretty much is the character in the ranges [A-Z] or [a-z], nothing more.

Noe if you want to test characters outside of this range, then you need to use one of the wide character variants such as iswalpha.

What it looks like you're asking is if you can perform a test that will reject characters that are not explicit cyrillic letters? That's not going to work with the iswalpha() test because it assumes all alpha characters from pretty much all character sets are alpha characters - if you read the locale definition of ru_RU (glibc source localedata/locales/ru_RU), which uses the i18n file as it's data source for character types determines what is considered an alpha.

If the input data is truly only from the russian alphabet, then you can check if the character is non-ascii and if that is the case then accept it as a valid character; unfortunately there is a good chance that some characters that are typed e.g. е (i.e. CYRILLIC SMALL LETTER IE Unicode: U+0435, UTF-8: D0 B5) will be entered using the latin character e (i.e. LATIN SMALL LETTER E Unicode: U+0065, UTF-8: 65) and so would be missed by this test.

if you want to test for those cyrillic characters explicitly, then you need to test for the character ranges:

% CYRILLIC/ 
   <U0400>..<U042F>;<U0460>..(2)..<U047E>;/ 
   <U0480>;<U048A>..(2)..<U04BE>;<U04C0>;<U04C1>..(2)..<U04CD>;/ 
   <U04D0>..(2)..<U04FE>;/ 
% CYRILLIC SUPPLEMENT/ 
   <U0500>..(2)..<U0522>;/ 
% CYRILLIC SUPPLEMENT 2/ 
   <UA640>..(2)..<UA65E>;<UA662>..(2)..<UA66C>;<UA680>..(2)..<UA696>;/ 
% CYRILLIC/ 
   <U0430>..<U045F>;<U0461>..(2)..<U047F>;/ 
   <U0481>;<U048B>..(2)..<U04BF>;<U04C2>..(2)..<U04CE>;/ 
   <U04CF>;/ 
   <U04D1>..(2)..<U0523>;/ 
% CYRILLIC SUPPLEMENT 2/ 
   <UA641>..(2)..<UA65F>;<UA663>..(2)..<UA66D>;<UA681>..(2)..<UA697>;/ 
Anya Shenanigans
  • 91,618
  • 3
  • 107
  • 122
  • mmm, yes you are right. I've forgot to say about iswalpha() for wchar_t. I use them. – dmigous May 08 '13 at 09:27
  • You should use `isspace`/`iswspace` in that case as an additional test – Anya Shenanigans May 08 '13 at 09:29
  • is iswspace with input 'e' symbol will return true? I need to filter other than alphabetical symbols of current locale – dmigous May 08 '13 at 09:33
  • No, iswspace only deals with whitespace characters. you should combine conditions with the logical or (`||`) operator e.g. `iswspace(c) || iswalpha(c)` would be is it either a space or an alpha character – Anya Shenanigans May 08 '13 at 09:44
  • But I need following iswalpha(L'e') => false iswalpha(L'ю') => true. From your last comments I can't undertand how it relates to issue – dmigous May 08 '13 at 10:10
  • That's not a test for alpha characters. That looks more to be a test of cyrillic specific characters. I'll update the answer – Anya Shenanigans May 08 '13 at 11:08