9

I have two array fields in a data frame.

enter image description here

I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame.

Expected output is:

enter image description here

Column B is a subset of column A. Also the words is going to be in the same order in both arrays.

Can any one please help me to get a solution for this?

ZygD
  • 22,092
  • 39
  • 79
  • 102
jiks-hue
  • 139
  • 1
  • 1
  • 7

2 Answers2

13

Since Spark 2.4.0, this can be solved easily using array_except. Taking the example

from pyspark.sql import functions as F

#example df
df=sqlContext.createDataFrame(pd.DataFrame(data=[[["hello", "world"], 
["world"]],[["sample", "overflow", "text"], ["sample", "text"]]], columns=["A", "B"]))


df=df.withColumn('difference', F.array_except('A', 'B'))

for more similar operations on arrays, I suggest this blogpost https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read

Benoit Descamps
  • 464
  • 5
  • 7
7

You can use a user-defined function. My example dataframe differs a bit from yours, but the code should work fine:

import pandas as pd
from pyspark.sql.types import *

#example df
df=sqlContext.createDataFrame(pd.DataFrame(data=[[["hello", "world"], 
["world"]],[["sample", "overflow", "text"], ["sample", "text"]]], columns=["A", "B"]))

# define udf
differencer=udf(lambda x,y: list(set(x)-set(y)), ArrayType(StringType()))
df=df.withColumn('difference', differencer('A', 'B'))

EDIT:

This does not work if there are duplicates as set retains only uniques. So you can amend the udf as follows:

differencer=udf(lambda x,y: [elt for elt in x if elt not in y] ), ArrayType(StringType()))
ags29
  • 2,621
  • 1
  • 8
  • 14
  • Thanks a ton ags29..It did work like a charm :-) only one doubt, it removes the duplicates too..Is there any way to keep all the values? – jiks-hue Oct 27 '17 at 15:10
  • 1
    Thanks again...It's retaining the duplicates now.. :-) I had to add a list in the command which I think you forgot to add...differencer=udf(lambda x,y: list([elt for elt in x if elt not in y] ), ArrayType(StringType())) – jiks-hue Oct 30 '17 at 10:39
  • @ags29 - Just a syntax change in the Edit section, in the `difference`. It should be `differencer=udf(lambda x,y: [elt for elt in x if elt not in y], ArrayType(StringType()))` – malhar Aug 08 '18 at 16:07