4

This was a fascinating debugging experience. Can you spot the difference between the following two lines?

StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]

They do very different things when you evaluate them. It turns out it's because the string being replaced in the first line consists of a unicode en dash, as opposed to a plain old ascii dash in the second line.

In the case of the unicode string, the regular expression doesn't match. I meant the regex "[\s\S]" to mean "match any character (including newline)" but Mathematica apparently treats it as "match any ascii character".

How can I fix the regular expression so the first line above evaluates the same as the second? Alternatively, is there an asciify filter I can apply to the strings first?

PS: The Mathematica documentation says that its string pattern matching is built on top of the Perl-Compatible Regular Expressions library (http://pcre.org) so the problem I'm having may not be specific to Mathematica.

tchrist
  • 78,834
  • 30
  • 123
  • 180
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • 1
    I don't know why this old question popped up as active, but the issue seems to have been resolved by version 10, both work now. unicode en dash is keyed as "\:2013" in mathematica btw. – agentp Mar 16 '17 at 16:08

3 Answers3

3

Here's an asciify function which I used as a workaround at first:

f[s_String] := s
f[x_] := FromCharacterCode[x]

asciify[s_String] := 
  StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")]

Then I realized, thanks to @Isaac's answer, that "." as a regular expression doesn't seem to have this unicode problem. I learned from the answers to Bug in Mathematica: regular expression applied to very long string that "(.|\n)" is ill-advised but that "(?s)." is recommended. So I think the best fix is the following:

StringReplace["–", RegularExpression@"(?s)." -> "abc"]
Community
  • 1
  • 1
dreeves
  • 26,430
  • 45
  • 154
  • 229
  • 1
    Interesting reading in the other question/answers you site. Given what's there, I'm inclined to agree that `"(?s)."` is probably better, though as I read those answers, the issue may be limited to `"(.|\n)*"` (with the `*`). – Isaac Mar 25 '10 at 05:15
3

I would use a StringExpression in place of RegularExpression. This works as desired:

f[s_String] := StringReplace[s, _ -> "abc"]

In a StringExpression, Blank[] will match anything, including non-ASCII characters.

EDIT in response to version updates: as of Mathematica 11.0.1, it looks like letter characters with character codes up to 2^16 - 1 (which is called out as the maximum value for FromCharacterCode), the results of StringMatchQ[LetterCharacter] now match those of LetterQ.

AllTrue[FromCharacterCode /@ Range[2^16 - 1], 
 LetterQ@# === StringMatchQ[#, LetterCharacter] &]
(* True *)
Pillsy
  • 9,781
  • 1
  • 43
  • 70
  • 2
    As noted in the Working with String Patterns tutorial [1] under "RegularExpression versus StringExpression", the string pattern `_` and `RegularExpression["(?s)."]` are equivalent. [1] http://reference.wolfram.com/mathematica/tutorial/WorkingWithStringPatterns.html – Michael Pilat Mar 26 '10 at 20:12
1

Using "(.|\n)" for the input to RegularExpression seems to work for me. The pattern matches . (any non-newline character) or \n (a newline character).

Isaac
  • 10,668
  • 5
  • 59
  • 68