I am trying to aggregate a table that I have around one kay value (id
here) so that I can have one row per id
and perform some verifications on the rows that belong to each id
in order to identify the 'result' (type of transaction of sorts). Lets say that after aggregating, I have something like this:
sc = SparkContext()
cols = ['id', 'list1', 'list2']
data = [('zero', ['cd1', 'cd7', 'cd5', 'cd2'], ['', '', '', 'debit']),('one', ['cd2', 'cd3', 'cd9', 'cd6'], ['credit', '', '', '']),('two', ['cd4', 'cd3', 'cd5', 'cd1'],['', '', '', ''])]
rdd = sc.parallelize(data)
df = rdd.toDF(cols)
>>> df.show()
+----+--------------------+--------------+
| id| list1| list2|
+----+--------------------+--------------+
|zero|[cd1, cd7, cd5, cd2]| [, , , debit]|
| one|[cd2, cd3, cd9, cd6]|[credit, , , ]|
| two|[cd4, cd3, cd5, cd1]| [, , , ]|
+----+--------------------+--------------+
The question I have to answer here is: does list1
have cd9
in it? If so, what is the corresponding value in list2
of list1
's cd2
?
What I have done to solve it was defining a couple of UDFs, since array functions in PySpark 1.6 are limited:
enum = F.udf(lambda x,y: [i for i, e in enumerate(x) if e==y], T.ArrayType(T.IntegerType()))
elat = F.udf(lambda x,y: [e for i, e in enumerate(x) if (i in y)], T.ArrayType(T.StringType()))
nulls = F.udf(lambda: [], T.ArrayType(T.IntegerType()))
Then creating a new 'lookup' column with the indexes of the elements I want to grab from the other column of lists:
df = df.withColumn('lookup',
F.when((F.array_contains(F.col('list1'), 'cd7')) | (F.array_contains(F.col('list1'), 'cd9')), enum(F.col('list1'), F.lit('cd2')))
.otherwise(nulls()))
And finally using this column to reach my endgoal:
df = df.withColumn('result',
F.when(F.array_contains(F.col('list1'), 'cd7') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'debit')), 'CD 7 - DEBIT')
.otherwise(F.when(F.array_contains(F.col('list1'), 'cd7') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'credit')), 'CD 7 - CREDIT')
.otherwise(F.when(F.array_contains(F.col('list1'), 'cd9') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'debit')), 'CD 9 - DEBIT')
.otherwise(F.when(F.array_contains(F.col('list1'), 'cd9') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'credit')), 'CD 9 - CREDIT')
.otherwise('etc')
)))
)
>>> df.show()
+----+--------------------+--------------+------+-------------+
| id| list1| list2|lookup| result|
+----+--------------------+--------------+------+-------------+
|zero|[cd1, cd7, cd5, cd2]| [, , , debit]| [3]| CD 7 - DEBIT|
| one|[cd2, cd3, cd9, cd6]|[credit, , , ]| [0]|CD 9 - CREDIT|
| two|[cd4, cd3, cd5, cd1]| [, , , ]| []| etc|
+----+--------------------+--------------+------+-------------+
But I would very much prefer if there was a way to achieve the same without creating one extra column, because the actual dataframe has more columns and the lookup
list may need to access different columns depending on the rule that I need to check for. When I tried to combine both elat
and enum
UDFs on one go it was unable to compute one or the other.