2

How would I go about using Regex to match Unicode strings? I'm loading in a couple keywords from a text file and using them with Regex on another file. The keywords both contain unicode (such as á, etc). I'm not sure where the problem is. Is there some option I have to set?


Code:

foreach (string currWord in _keywordList)
{
    MatchCollection mCount = Regex.Matches(
        nSearch.InnerHtml, "\\b" + @currWord + "\\b", RegexOptions.IgnoreCase);

    if (mCount.Count > 0)
    {
        wordFound.Add(currWord);
        MessageBox.Show(@currWord, mCount.ToString());
    }
}

And reading the keywords to a list:

var rdComp = new StreamReader(opnDiag.FileName);
string compSplit = rdComp.ReadToEnd()
                         .Replace("\r\n", "\n")
                         .Replace("\n\r", "\n");
rdComp.Dispose();
string[] compList = compSplit.Split(new[] {'\n'});

Then I change the array to a list.

tchrist
  • 78,834
  • 30
  • 123
  • 180
cam
  • 8,725
  • 18
  • 57
  • 81

2 Answers2

1

When matching on a specific character, I believe regular expressions only support literals for the ASCII character set. Beyond that, you can use \uxxxx to match on the Unicode code point.

See here.

mbeckish
  • 10,485
  • 5
  • 30
  • 55
  • I'm not sure that's the problem. She/he isn't using character classes but verbatim strings, surrounded by word boundaries. – Tim Pietzcker Mar 29 '10 at 14:08
  • @Pietzcker - That's the problem. S/he needs to parse the string and add each character as a unicode code point. – mbeckish Mar 29 '10 at 20:15
  • 1
    Well, I've just tried using Unicode literals in a regex in C#, and it worked perfectly. `Console.WriteLine(Regex.Replace("It BӦЯӁڀ!", @"\bBӦЯӁڀ\b", "works"));` returns `It works!` – Tim Pietzcker Mar 29 '10 at 20:56
0

You can use [\u0000-\uffff]+ to match at least the BMP

brighty
  • 406
  • 3
  • 10