1

I am able to successfully drop duplicates using Spark Dataframe method dropDuplicates which considers 100% match in exact order as duplicate. So for example if we have two "red toys", one of them is considered duplicate and gets filtered out.

Now the new requirement says that, we need to consider same words in reverse order also as duplicate. So referring to above example if we have "red toys" and "toys red", they will be considered duplicate and should be removed. This requirement is applicable only for 2 words phrases.

Can someone please suggest the approach to take for this in Spark. Also, wondering if this is a use-case of Machine Learning or NLP.

Anand
  • 20,708
  • 48
  • 131
  • 198

1 Answers1

0

The most straightforward solution would be to split the sentence into an array of words, sort the array and then drop duplicates depending on this new column.

In Spark 2.4.0+ this can be done using array_sort and split as follows:

df.withColumn("arr", array_sort(split($"words", " ")))
  .dropDuplicates("arr")

The new arr column can be dropped with .drop(arr) if wanted.


Using an older Spark version or if more complex logic needs to be used (e.g. only consider two word phrases for reverse dropping), an UDF needs to be used. For example, to only consider two word phrases we can use:

val sort_udf = udf((arr: Seq[String]) => {
  if (arr.size == 2) arr.sorted else arr
})

df.withColumn("arr", sort_udf(split($"words", " ")))
  .dropDuplicates("arr")
Shaido
  • 27,497
  • 23
  • 70
  • 73