6

I was trying to remove duplicate words from a string in scala.

I wrote a udf(code below) to remove duplicate words from string:

val de_duplicate: UserDefinedFunction = udf ((value: String) => {
if(value == "" | value == null){""}
else {value.split("\\s+").distinct.mkString(" ")}
})

The problem I'm facing with this is that it is also removing single character tokens from the string,

For example if the string was:

"test abc abc 123 foo bar f f f"

The output I'm getting is:

"test abc 123 foo bar f"

What I want to do so remove only repeating words and not single characters, One workaround I could think of was to replace the spaces between any single character tokens in the string so that the example input string would become:

"test abc abc 123 foo bar fff"  

which would solve my problem, I can't figure out the proper regex pattern but I believe this could be done using capture group or look-ahead. I looked at similar questions for other languages but couldn't figure out the regex pattern in scala.

Any help on this would be appreciated!

Vaibhav
  • 338
  • 2
  • 13

2 Answers2

7

If you want to remove spaces between single character in your input string, you can just use the following regex:

println("test abc abc 123 foo bar f f f".replaceAll("(?<= \\w|^\\w|^) (?=\\w |\\w$|$)", ""));

Output:

test abc abc 123 foo bar fff

Demo: https://regex101.com/r/tEKkeP/1

Explanations:

The regex: (?<= \w|^\w|^) (?=\w |\w$|$) will match spaces that are surrounded by one word character (with eventually a space before after it, or the beginning/end of line anchors) via positive lookahead/lookbehind closes.

More inputs:

test abc abc 123 foo bar f f f
f boo
 f boo
boo f
boo f f
too f 

Associated outputs:

test abc abc 123 foo bar fff
f boo
f boo
boo f
boo ff
too f
Allan
  • 12,117
  • 3
  • 27
  • 51
2

You can use this regex to target duplicate words present in a string having length two or more characters and replace them with empty string to retain only unique words,

\b(\w{2,})\b\s*(?=.*\1)

Explanation:

  • \b(\w{2,})\b - Selects a word having at least two characters
  • \s* - This optional whitespace is there to remove any space present after the word, so unneeded space doesn't lie there
  • (?=.*\1) - This positive look ahead is the key here to target duplicate words and works by selecting a word if the same word is present ahead in the string

Regex Demo

Scala Code Demo

object Rextester extends App {
    val s = "abc test abc    abc 123 foo bar foo f sd foo f f abc"
    println("Unique words only: " + s.replaceAll("\\b(\\w{2,})\\b\\s*(?=.*\\1)",""))
 }

Outputs unique words only,

Unique words only: test 123 bar f sd foo f f abc

Edit:

As removing duplicate words is not what you wanted and you just wanted to remove one or more space between single character words, you can use this regex,

(?<=^|\b\w) +(?=\w\b|$)

and remove it with empty string,

Regex Demo

Scala Code,

val s = "test abc abc 123 foo bar f f f"
println("Val: " + s.replaceAll("(?<=^|\\b\\w) +(?=\\w\\b|$)",""))

Output,

Val: test abc abc 123 foo bar fff
Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
  • Could you please explain the syntax? – Vaibhav May 20 '19 at 07:20
  • Would this regex work even if the duplicate words are not occurring one after the other, example - "foo bar foo bar" should be returned as "foo bar" – Vaibhav May 20 '19 at 07:22
  • @Vaibhav: I've added the explanation. Basically the positive lookahead will select a word which is present again in the string later somewhere ahead. – Pushpesh Kumar Rajwanshi May 20 '19 at 07:23
  • @Vaibhav: Yes, right. It will target any duplicate word no matter wherever the word is present. Only the last duplicate word will not be selected because after last duplicate word, there won't be the same word repeated again and hence only last word will be retained and rest all duplicate words will be removed. – Pushpesh Kumar Rajwanshi May 20 '19 at 07:24
  • You can play with the string in my regex101 demo link contained in my answer. – Pushpesh Kumar Rajwanshi May 20 '19 at 07:25
  • Thanks for the answer, it does remove the duplicate instances of words but I accepted the other one since the regex above leaves the last instance of the duplicate word which would modify the order of occurrence of words in the original string which I want to maintain which the regex in the other answer maintains. – Vaibhav May 20 '19 at 08:28
  • 1
    @Vaibhav: I didn't knew if you wanted to preserve the order and the other answer doesn't remove the duplicate words at all. Had I known you only wanted to remove the space between single character words, I would have given you an even simpler regex than other answer. You can use this regex [`(?<=^|\b\w) +(?=\w\b|$)`](https://regex101.com/r/9XZdrZ/1/) which also takes care if more than one space is present between words. – Pushpesh Kumar Rajwanshi May 20 '19 at 08:34