0

I need to match (NOT DELETE) all duplicates words in a text.

For example: Men's·Tee·Shirt·Vintage·T·Shirt·1990·Deep·Black·Red·Text·Deep·Black·Red·Text·X-Small

Deep·Black·Red·Text·Deep·Black·Red·Text are repeating.

None of the regex i could find works.

Please help!

P.S. sometimes it is goind to be just one words matching: e.g. brown brown, and sometimes a pattern like i've mentioned before.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
zerina
  • 131
  • 1
  • 1
  • 4
  • 2
    What regex you tried and how didn't it work? – Wiktor Stribiżew Jun 11 '18 at 08:33
  • In [DataPrep](https://cloud.google.com/dataprep/docs/html/Supported-Special-Regular-Expression-Characters_57344771), the regex seems to be rather "weak", as it does not seem to support lookarounds or backreferences. I doubt you can use it to get what you want with regex. Consider using a different tool or approach. – Wiktor Stribiżew Jun 11 '18 at 09:07
  • Hi, thank you for your feedback. Do you have any other suggestions as to how can I find these duplicates. – zerina Jun 11 '18 at 09:12
  • No idea, I do not understand what you want to get in the end, nor do I know your workflow, data, etc. – Wiktor Stribiżew Jun 11 '18 at 09:13
  • I want to eliminate all duplicate words with a recipe within dataprep. – zerina Jun 11 '18 at 09:17

1 Answers1

-1

You can use the RegEx \b(\w+)\b(?=.*\b\1\b)

  • \b(\w+)\b matches any word character 1 or more times, preceded and followed by a word boundary

  • (?=.*\b\1\b) makes sure that there is a repetition of what was matched in the first group, after your match.

Demo.

Zenoo
  • 12,670
  • 4
  • 45
  • 69
  • 1
    This does not really work. `\w+` does not match those weird centered dots, so the `\1` is only one word ("deep" in this case) and all the rest is in the `?=`. – tobias_k Jun 11 '18 at 08:39
  • @tobias_k Hum, no, you can clearly see every correct match on the right side. – Zenoo Jun 11 '18 at 08:40
  • But it identifies all those words as individual repeated words, with arbitrary `.*` in between, not at a single repeated sequence. Not sure if that's what OP wants. (not my downvote BTW) – tobias_k Jun 11 '18 at 08:42
  • You could add this slight variation to account for "full" repeated sequences: `\b([\w·]+)·(\1)\b` – tobias_k Jun 11 '18 at 08:43
  • I don't get what you're saying about the "single repeated sequence", can you elaborate please? – Zenoo Jun 11 '18 at 08:44
  • @tobias_k As OP stated in his question, he wants "all duplicates words in a text.". So the Regex should separate those words, at least that's what I understood. – Zenoo Jun 11 '18 at 08:47
  • Also, I am working in Dataprep, do you have any suggestions regarding this specific toll. – zerina Jun 11 '18 at 08:57
  • @zerina If I understood you correctly, the Regex should work. Did you try it out? I have no knowledge of Dataprep in particular. – Zenoo Jun 11 '18 at 09:00
  • 1
    FYI none of these regex you provided works. Maybe it is due to this toll. – zerina Jun 11 '18 at 09:03
  • According to a comment by Wiktor, `\1` seems not to work in DataPrep. However, for this kind of regex task, this is pretty crucial. About what I meant: Your regex checks that "deep" appears again somewhere later in the text, as does "black", "red", and "text", but it does not find "deep black red text", i.e. just the individual words, not the entire sequence of words. Also, it matches "shirt", which might not be wanted. – tobias_k Jun 11 '18 at 09:59