0

First i know match chinese unicode should use

[\x{4e00}-\x{9fa5}]

Then i use group and backreference

([\x{4e00}-\x{9fa5}])\1

But the result is adjacency, like "中中".

I need all the character which appear more than one time anywhere in the text. Like

中国保持中立
^      ^

PS.I use textmate editor.

Any help? TIA!

Maadiah
  • 431
  • 6
  • 20
  • Don't know about textmate, but will `([\x{4e00}-\x{9fa5}]).*\1` help? – Passerby Feb 27 '13 at 07:49
  • Do not work as expect : – Maadiah Feb 27 '13 at 07:56
  • @Maadiah, what did you expect? – stema Feb 27 '13 at 08:12
  • @Maadiah Huh, it works in JS (in form `([\u4e00-\u9fa5]).*\1`). Also sorry for the edit, as SO's editor seems a little misleading. – Passerby Feb 27 '13 at 08:18
  • sorry for my bad english.There are too many result when use `([\x{4e00}-\x{9fa5}]).*\1` [Picture](https://www.dropbox.com/s/30colhf0drgpg2u/%E5%B1%8F%E5%B9%95%E5%BF%AB%E7%85%A7%202013-02-27%20%E4%B8%8B%E5%8D%884.15.06.png) – Maadiah Feb 27 '13 at 08:20
  • 1
    @Maadiah, why are there too many results? All your matches start and end with the same character, so it found correctly characters that occur more than once. – stema Feb 27 '13 at 08:23
  • @Maadiah The result in the screenshot is exactly what you asked for. – deerchao Feb 27 '13 at 08:30

1 Answers1

2

You can do:

  1. Match everything till the last occurrence of that character

    ([\x{4e00}-\x{9fa5}]).*\1
    

    See it here on Regexr

  2. Match everything till the next occurrence of that character

    ([\x{4e00}-\x{9fa5}]).*?\1
    

    See it here on Regexr

  3. If you want to match only a character that is occurring also later on in the text and you don't want match everything in between and if lookaheads are supported

    ([\x{4e00}-\x{9fa5}])(?=.*\1)
    

    See it here on Regexr

    This will not match the last occurrence! (Because the character is not following anymore in the text.)

stema
  • 90,351
  • 20
  • 107
  • 135
  • (Your answer is OK, just some side comment). According to [this](http://manual.macromates.com/en/regular_expressions), TextMate uses Oniguruma regular expression library, while RegExr uses Flex 3 flavor. It happens that the features are overlapping, especially `\x{hhhh}`, since the documentation for [Flex 3 regex](http://livedocs.adobe.com/flex/3/html/help.html?content=12_Using_Regular_Expressions_03.html) doesn't say anything about it, and `\uhhhh` is supposed to work but doesn't work in Flex 3. – nhahtdh Feb 27 '13 at 09:04
  • 1
    @nhahtdh, I like Regexr for the user interface, but the behaviour is a bit inconsistent. It seems to support more than it should. e.g. I know that Regexr is supporting fixed length **lookbehind**, but in the Flex3 documentation only lookahead is mentioned. – stema Feb 27 '13 at 09:25