3

I'm trying to do a regex operations in Java. But when I search in the Turkish text , I'm having trouble . For example;

Search Text = "Ahmet Yıldırım" or "Esin AYDEMİR" 

//The e-mail stated in part(Ex: yildirim@example.com) , trying to look in name.
Regex Strings = "yildirim" or  "aydemir". 

Searched text is dynamically changing.Therefore , how can I solve this by using java regex pattern. Or How do I convert Turkish characters(Ex: AYDEMİR convert to AYDEMIR or Yıldırım -> Yildirim).

Sorry, about my grammer mistakes!...

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Junior Develepor
  • 192
  • 2
  • 18
  • Ok but how convert to "yildirim" to "y[iı]ld[ıi]r[ıi]m". Dynamically, in every text ("İÖÜŞÇĞıöüşçğ") How can I determine this character. And convert to ("IOUSCGiouscg") – Junior Develepor Aug 20 '15 at 12:21

4 Answers4

10

Use Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flag:

Pattern p = Pattern.compile("yildirim", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

Demo on ideone

Pattern.CASE_INSENSITIVE by default only match case-insensitively for characters in US-ASCII character set. Pattern.UNICODE_CASE modifies the behavior to make it match case-insensitively for all Unicode characters.

Do note that Unicode case-insensitive matching in Java regex is done in a culture-insensitive manner. Therefore, ı, i, I, İ are considered the same character.

Depending on your use case, you might want to use Pattern.LITERAL if you want to disable all metacharacters in the pattern, or only escape literal parts of the pattern with Pattern.quote()

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Do you know of a way to achieve the same behavior in .NET? Just curious. – Wiktor Stribiżew Aug 20 '15 at 12:47
  • @stribizhev: That's a good question. I thought `IgnoreCase | CulturalInvariant` would work, but it turns out that it doesn't. You might want to ask a new question? (I also want to know the answer) – nhahtdh Aug 20 '15 at 13:00
  • @stribizhev: Actually, in .NET, we can solve this problem by setting the appropriate culture (in this case, Turkish) - though it means that you must at least know the language of the input before you process it. – nhahtdh Aug 20 '15 at 13:07
  • I also checked `CultureInvariant` flag at first. I will study this when I have some time. – Wiktor Stribiżew Aug 20 '15 at 13:18
8

The question in your comment is more complicated than the original one.

You can use

string=Normalizer.normalize(string, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");

to convert "İÖÜŞÇĞıöüşçğ" to "IOUSCGıouscg" which is already sufficient for a case insensitive match as pointed out by nhahtdh. If you want to perform a case sensitive match, you have to add a .replace('ı', 'i') to match ı with i.

Community
  • 1
  • 1
Holger
  • 285,553
  • 42
  • 434
  • 765
0

I am using this pattern.

public static boolean isAlphaNumericWithWhiteSpace(String text) {
        return text != null && text.matches("^[\\p{L}\\p{N}ın\\s]*$");
    }

\p{L} matches a single code point in the category "letter".

\p{N} matches any kind of numeric character in any script.

egemen
  • 779
  • 12
  • 23
-1

git hub url for replacing the Turkish char https://gist.github.com/onuryilmaz/6034569

in java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.") will check whether the String contains Turkish charters.

Robert
  • 7,394
  • 40
  • 45
  • 64
  • Can you edit your answer by surrounding the code snippet in back-ticks? That would improve readability. – Jose Aug 27 '18 at 16:38