0

I have a text that contains both Cyrillic and Latin characters and I'm trying to determine the ratio of Cyrillic to Latin words. I tried using the Unicode package but couldn't find anything there for counting the different types of words. Is there a way to get a word count or something similar with R that differentiates Cyrillic and Latin words within one text? The text is UTF-8.

Matthieu Brucher
  • 21,634
  • 7
  • 38
  • 62
  • Probably not the most efficient but you could use `grep()` to search for all instances of Latin and Cyrillic characters. – CephBirk Dec 07 '16 at 03:49

1 Answers1

0

Here is a reproducible example, since one was not provided:

texmix <- "Лорем ипсум долор сит амет, ин лаборе глориатур дуо, видиссе аццусамус не мел.
 Оцурререт репрехендунт вих ат, вел ин цонвенире волуптатум.
 Иллуд дицит нолуиссе при цу, вих ех диам дебет.
 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
 Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
 Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

I search for "Cyrillic text sample" and copied from the first search result along with Lorem ipsum. I think it might be part of Lorem ipsum in Cyrillic characters. Feel free to provide a more accurate reproducible example.


You can search for and count "words" or contiguous blocks of characters of the appropriate type to get a rough answer. It's a rough answer because I didn't fully deal with hyphenated words, contractions, and other such edge cases, though see the example below labeled "Edge Cases". I'm not sure what sorts of edge cases need to be covered in cyrillic text, so I leave that to the reader:

library(stringi)
## count of cyrillic "words"
stri_count_regex(texmix, "[\\p{Letter}&&\\p{script=cyrillic}]+")
# [1] 30
## count of latin "words"
stri_count_regex(texmix, "[\\p{Letter}&&\\p{script=latin}]+")
# [1] 69

## ratio
stri_count_regex(texmix, "[\\p{Letter}&&\\p{script=cyrillic}]+") /
stri_count_regex(texmix, "[\\p{Letter}&&\\p{script=latin}]+")
# [1] 0.4347826

I took the pattern from the stringi reference manual (under "stringi-search-charclass"):

[\p{Letter}&&\p{script=cyrillic}] Logical AND or intersection – match the set of all Cyrillic letters.

Though you could use the less specific stri_count_regex(texmix, "\\p{Cyrillic}+") and stri_count_regex(texmix, "\\p{Latin}+").


Edge Cases

You can start to address whatever edge cases you might need to address, like hyphenated words or contractions using approaches like this:

stri_count_regex(texmix, 
    "[\\p{Letter}&&\\p{script=latin}]+[-']?[\\p{Letter}&&\\p{script=latin}]*")

where you have an optional hyphen or apostrophe ([-']?) followed by 0 or more Latin letters ([\\p{Letter}&&\\p{script=latin}]*)



A similar approach in base R, if you prefer not to use stringi, could be:

lengths(gregexpr("\\p{Cyrillic}+", texmix, perl = TRUE))
# [1] 30
lengths(gregexpr("\\p{Latin}+", texmix, perl = TRUE))
# [1] 69

Further potentially helpful info on these Unicode character properties is available here: http://www.regular-expressions.info/unicode.html

Jota
  • 17,281
  • 7
  • 63
  • 93