I am able to successfully drop duplicates using Spark Dataframe
method dropDuplicates
which considers 100% match in exact order as duplicate. So for example if we have two "red toys"
, one of them is considered duplicate and gets filtered out.
Now the new requirement says that, we need to consider same words in reverse order also as duplicate. So referring to above example if we have "red toys"
and "toys red"
, they will be considered duplicate and should be removed. This requirement is applicable only for 2 words phrases.
Can someone please suggest the approach to take for this in Spark
. Also, wondering if this is a use-case of Machine Learning
or NLP
.