0

I am trying to aggregate a table that I have around one kay value (id here) so that I can have one row per id and perform some verifications on the rows that belong to each id in order to identify the 'result' (type of transaction of sorts). Lets say that after aggregating, I have something like this:

sc = SparkContext()

cols = ['id', 'list1', 'list2']
data = [('zero', ['cd1', 'cd7', 'cd5', 'cd2'], ['', '', '', 'debit']),('one', ['cd2', 'cd3', 'cd9', 'cd6'], ['credit', '', '', '']),('two', ['cd4', 'cd3', 'cd5', 'cd1'],['', '', '', ''])]

rdd = sc.parallelize(data)
df = rdd.toDF(cols)


>>> df.show()
+----+--------------------+--------------+
|  id|               list1|         list2|
+----+--------------------+--------------+
|zero|[cd1, cd7, cd5, cd2]| [, , , debit]|
| one|[cd2, cd3, cd9, cd6]|[credit, , , ]|
| two|[cd4, cd3, cd5, cd1]|      [, , , ]|
+----+--------------------+--------------+

The question I have to answer here is: does list1 have cd9 in it? If so, what is the corresponding value in list2 of list1's cd2?

What I have done to solve it was defining a couple of UDFs, since array functions in PySpark 1.6 are limited:


enum = F.udf(lambda x,y: [i for i, e in enumerate(x) if e==y], T.ArrayType(T.IntegerType()))
elat = F.udf(lambda x,y: [e for i, e in enumerate(x) if (i in y)], T.ArrayType(T.StringType()))
nulls = F.udf(lambda: [], T.ArrayType(T.IntegerType()))

Then creating a new 'lookup' column with the indexes of the elements I want to grab from the other column of lists:


df = df.withColumn('lookup',
    F.when((F.array_contains(F.col('list1'), 'cd7')) | (F.array_contains(F.col('list1'), 'cd9')), enum(F.col('list1'), F.lit('cd2')))
    .otherwise(nulls()))

And finally using this column to reach my endgoal:


df = df.withColumn('result',
    F.when(F.array_contains(F.col('list1'), 'cd7') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'debit')), 'CD 7 - DEBIT')
    .otherwise(F.when(F.array_contains(F.col('list1'), 'cd7') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'credit')), 'CD 7 - CREDIT')
    .otherwise(F.when(F.array_contains(F.col('list1'), 'cd9') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'debit')), 'CD 9 - DEBIT')
    .otherwise(F.when(F.array_contains(F.col('list1'), 'cd9') & (F.array_contains(elat(F.col('list2'), F.col('lookup')),'credit')), 'CD 9 - CREDIT')
    .otherwise('etc')
    )))
)

>>> df.show()
+----+--------------------+--------------+------+-------------+
|  id|               list1|         list2|lookup|       result|
+----+--------------------+--------------+------+-------------+
|zero|[cd1, cd7, cd5, cd2]| [, , , debit]|   [3]| CD 7 - DEBIT|
| one|[cd2, cd3, cd9, cd6]|[credit, , , ]|   [0]|CD 9 - CREDIT|
| two|[cd4, cd3, cd5, cd1]|      [, , , ]|    []|          etc|
+----+--------------------+--------------+------+-------------+

But I would very much prefer if there was a way to achieve the same without creating one extra column, because the actual dataframe has more columns and the lookup list may need to access different columns depending on the rule that I need to check for. When I tried to combine both elat and enum UDFs on one go it was unable to compute one or the other.

0 Answers0