I have two data frames: df1
+---+-----------------+
|id1| items1|
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
and df2
:
+---+-----------------+
|id2| items2|
+---+-----------------+
|001| [A, C]|
|002| [D]|
|003| [E, A, B]|
|004| [B, D, C]|
|005| [F, B]|
|006| [G, E]|
+---+-----------------+
I would like to create an indicator vector (in a new column result_array
in df1
) based on values in items2
. The vector should be of the same length as number of rows in df2
(in this example it should have 6 elements). Its elements should have either value of 1.0 if the row in items1
contains all the elements in the corresponding row of items2
, or value 0.0 otherwise. The result should look as follows:
+---+-----------------+-------------------------+
|id1| items1| result_array|
+---+-----------------+-------------------------+
| 0| [B, C, D, E]|[0.0,1.0,0.0,1.0,0.0,0.0]|
| 1| [E, A, C]|[1.0,0.0,0.0,0.0,0.0,0.0]|
| 2| [F, A, E, B]|[0.0,0.0,1.0,0.0,1.0,0.0]|
| 3| [E, G, A]|[0.0,0.0,0.0,0.0,0.0,1.0]|
| 4| [A, C, E, B, D]|[1.0,1.0,1.0,1.0,0.0,0.0]|
+---+-----------------+-------------------------+
For example, in row 0, the second value is 1.0 because [D] is a subset of [B, C, D, E] and the fourth value is 1.0 because [B, D, C] is a subset of [B, C, D, E]. All other item groups in df2
are not subsets of [B, C, D, E], thus their indicator values are 0.0.
I've tried to create a list of all item groups in items2
using collect() and then apply a udf but my data is too large (over 10 million rows).