6

I would like to know if a string contains Russian/Cyrillic characters.

For latin characters, I do something like this (pseudocode):

text := "test"
for _, r := range []rune(text) {
    if r >= 'a' && r <= 'z' {
        return True
    }
}
return False

What is the corresponding way to do it for Russian/Cyrillic alphabet?

Thomas
  • 8,306
  • 8
  • 53
  • 92
  • Did you try just using the Unicode charts (i assume your input is unicode), like this for example [link](http://sites.psu.edu/symbolcodes/languages/europe/cyrillic/cyrillicchart/)? Just iterate over whatever values you are interested in. – K. Kirsz Jun 27 '17 at 20:35

2 Answers2

14

This seems to work

unicode.Is(unicode.Cyrillic, r) // r is a rune
Thomas
  • 8,306
  • 8
  • 53
  • 92
  • 2
    This is the way to go. It catches the full range of Cyrillic characters including oddballs like ᴫ U+01D2B, and the huge range at U+0A640 and U+00460. – Schwern Jun 27 '17 at 21:20
1

I went on and did this example implementation for finding russian uppercase chars, based on this Unicode chart:

func isRussianUpper(text string) bool {
    for _, r := range []rune(text) {
        if r < '\u0410' || r > '\u042F' {
            return false
        }
    }
    return true
}

You can do any set of characters this way. Just modify the codes of characters you are interested in.

K. Kirsz
  • 1,384
  • 10
  • 11
  • thanks, some letters in the russian alphabet look like latin letters (like the o or A) so I thought I would have to do something more complicated – Thomas Jun 27 '17 at 20:53
  • 3
    Note that there are many Cyrillic characters outside of that range. There's more at U+00460 (that range alternates upper and lower), one at U+01D2B, and more at U+0A640. And more can be added. [Avoid hard coding ranges, use `unicode.Is(unicode.Cyrillic, r)` instead](https://stackoverflow.com/a/44789675/14660). To distinguish between upper and lower case, use `unicode.IsUpper` and `unicode.IsLower` keeping in mind that there are some characters which are *both* and some which are *neither*. – Schwern Jun 27 '17 at 21:19