I was trying to remove duplicate words from a string in scala.
I wrote a udf(code below) to remove duplicate words from string:
val de_duplicate: UserDefinedFunction = udf ((value: String) => {
if(value == "" | value == null){""}
else {value.split("\\s+").distinct.mkString(" ")}
})
The problem I'm facing with this is that it is also removing single character tokens from the string,
For example if the string was:
"test abc abc 123 foo bar f f f"
The output I'm getting is:
"test abc 123 foo bar f"
What I want to do so remove only repeating words and not single characters, One workaround I could think of was to replace the spaces between any single character tokens in the string so that the example input string would become:
"test abc abc 123 foo bar fff"
which would solve my problem, I can't figure out the proper regex pattern but I believe this could be done using capture group or look-ahead. I looked at similar questions for other languages but couldn't figure out the regex pattern in scala.
Any help on this would be appreciated!