3

I am trying to perform case insensitive matching with Pattern and Matcher classes in Java, for Russian language. Below is the text:

"some text газированных напитков some other text"

Below is the Pattern I am using to match the text:

Pattern pattern = Pattern.compile("(?iu)\\b(" + Pattern.quote("напитки") + ")\\b", Pattern.UNICODE_CHARACTER_CLASS);

I am expecting the following to return true as it's a case insensitive comparison (напитки vs напитков):

System.out.println(pattern.matcher("some text газированных напитков some other text").find());

But it always returns false. I have tried with other Pattern constants (like CASE_INSENSITIVE, UNICODE_CASE, CANON_EQ), however, it still returns false.

Is there any way in Java to perform such comparison? Is it even possible at all?

Darshan Mehta
  • 30,102
  • 11
  • 68
  • 102
  • `\\b` at end might be cause of `false` since there is a character after `напитки`. Since your regex is `\bнапитки\b` there is no match for part of `напитков` – Rahul May 02 '17 at 12:01
  • 2
    Wait, your text contains no `напитки` (Russian for "beverages", Plural, Nominative case). `напитков` is the same word in Plural, Genitive case. Do you mean you want to match any grammatical case with the noun in the Nominative case? – Wiktor Stribiżew May 02 '17 at 12:02
  • @WiktorStribiżew It contains `напитков` which is case insensitive version of `напитки` I believe. At least [google translate](https://translate.google.com/#auto/en/%D0%BD%D0%B0%D0%BF%D0%B8%D1%82%D0%BA%D0%BE%D0%B2%20%0A%D0%BD%D0%B0%D0%BF%D0%B8%D1%82%D0%BA%D0%B8) says so :) – Darshan Mehta May 02 '17 at 12:04
  • @DarshanMehta: Don't you see different endings? `и` != `ов`. – Wiktor Stribiżew May 02 '17 at 12:04
  • @WiktorStribiżew Yes, I can see different endings, I believe they are because of the way case is interpreted in Russina language? Yes, I want to match any grammatical case if that's possible (at all). – Darshan Mehta May 02 '17 at 12:06
  • 2
    No, it is impossible with regex and you are wrong about the case sensitivity with Cyrillic symbols. You shoud normalize the words in the sentence (with some NLP package) and then search for the word in the Nominative case there. – Wiktor Stribiżew May 02 '17 at 12:07
  • @WiktorStribiżew Ah okay, if it's not possible with plain regex, thats's fine. That's all I wanted to know. Not sure why it got downvoted though. – Darshan Mehta May 02 '17 at 12:11
  • Hmm, as I know from Russian language (my native language), case insensitive means in upper or lower case. @DarshanMehta - different ending in "имя существительное" means different "падеж" or grammatical case in English. – Vitaliy May 02 '17 at 13:31
  • I think the source of confusion here might be the word “case.” It can refer to the case of characters—lowercase, uppercase, title case—but it can also refer to the declension of a noun, such as [accusative case](https://en.wikipedia.org/wiki/Accusative_case). However, “case insensitive” always refers to the case of characters. Regular expressions cannot match different noun declensions. – VGR May 02 '17 at 14:19

2 Answers2

11

Just add this option in your Pattern:

Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE

This worked in all my cases for cyrrilic. And I use it really extensively.

Vitaliy
  • 489
  • 6
  • 20
0

This will work properly:

Pattern pattern = Pattern.compile("(?iu)\\b(" + Pattern.quote("напитк") + ")\\b");
System.out.println(pattern.matcher("some text газированных \"напитк\"ов some other text").find());