For your given dataframe
, you can simply do the following
from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
.withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()
You should have output dataframe
as
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| null| null|
|1112| [45, 46]| [50, 50]|
|1113| null| null|
+----+---------+-----------+
Updated
In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic
from pyspark.sql import functions as F
for c in df.dtypes:
if "array" in c[1]:
df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()
Here,
df.dtypes
would give you array of tuples with column name and datatype. As for the dataframe in the question it would be
[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]
withColumn
is applied to only array columns ("array" in c[1])
where F.size(F.col(c[0])) == 0
is the condition checking for when
function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.