Given the following example:
d = [{'asset': '2', 'ts': 6, 'B':'123','C':'234'},
{'asset': '1', 'ts': 5, 'C.1':'999', 'B':'888','F':'999'}]
df = spark.createDataFrame(d)
df.show(truncate=False)
+---+----+-----+---+----+----+
|B |C |asset|ts |C.1 |F |
+---+----+-----+---+----+----+
|123|234 |2 |6 |null|null|
|888|null|1 |5 |999 |999 |
+---+----+-----+---+----+----+
I want to create the following output:
+-----+---+--------------------------------+
|asset|ts |signals |
+-----+---+--------------------------------+
|2 |6 |[B -> 123, C -> 234] |
|1 |5 |[B -> 888, C.1 -> 999, F -> 999]|
+-----+---+--------------------------------+
I tried the following:
from itertools import chain
from pyspark.sql.functions import *
all_signals=['B','C','C.1','F']
key_values = create_map(*(chain(*[(lit(name), col("`"+name+"`"))
for name in all_signals])))
new_df = df.withColumn('signals',key_values).drop(*all_signals).show(truncate=False)
+-----+---+--------------------------------------+
|asset|ts |signals |
+-----+---+--------------------------------------+
|2 |6 |[B -> 123, C -> 234, C.1 ->, F ->] |
|1 |5 |[B -> 888, C ->, C.1 -> 999, F -> 999]|
+-----+---+--------------------------------------+
But I don't want the keys with null values. So I tried so many ways to exclude null or None. I tried "if" conditions, when/otherwise but none seem to work. Here is one attempt:
key_values = create_map(*(chain(*[(lit(name), col("`"+name+"`"))
for name in all_signals
if col("`"+name+"`").isNotNull()])))
new_df = df.withColumn('signals',key_values).drop(*all_signals).show(truncate=False)
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I got it to work using a circular way which I am not happy with:
new_df= df.withColumn("signals", from_json(
to_json(struct(["`"+x+"`" for x in all_signals])),"MAP<STRING,STRING>"))
new_df = new_df.drop(*all_signals)
new_df.show(truncate=False)
+-----+---+--------------------------------+
|asset|ts |signals |
+-----+---+--------------------------------+
|2 |6 |[B -> 123, C -> 234] |
|1 |5 |[B -> 888, C.1 -> 999, F -> 999]|
+-----+---+--------------------------------+
But there must be a way to exclude null without going to json and back!