As you may know, it's common sense that UDF is generally not a good idea in pyspark and it seems to me that it's possible to solve it using sql functions.
You can start by creating your ArrayList of ArrayLists using array
function:
df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
And then remove the null
values using array_except
:
df.withColumn('F', array_except('FWithNulls', array(lit(None))))
Tested with pyspark 3.1.2:
from pyspark.sql.functions import lit, array, array_except
from pyspark.sql.types import StringType, ArrayType
(
spark.createDataFrame(
[
{
'A': ['a', 'b', 'c'],
'B': ['b', 'c', 'd'],
'E': ['z']
}
]
)
.withColumn('C', lit(None).cast(ArrayType(StringType(), True)))
.withColumn('D', lit(None).cast(ArrayType(StringType(), True)))
.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
.withColumn('F', array_except('FWithNulls', array(lit(None))))
.show(vertical=True, truncate=False)
)
-RECORD 0---------------------------------------------
A | [a, b, c]
B | [b, c, d]
E | [z]
C | null
D | null
FWithNulls | [[a, b, c], [b, c, d], null, null, [z]]
F | [[a, b, c], [b, c, d], [z]]
And as another option (that also works for more complex evaluations) you can use the filter + higher-order functions (available since version 2.4.0) as explained by @David Vrba in https://stackoverflow.com/a/57649346/18115573
from pyspark.sql.functions import expr
(
df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
.withColumn('F', expr('filter(FWithNulls, x -> x is not null)'))
.show(vertical=True, truncate=False)
)
-RECORD 0-----------------------------------
A | [a, b, c]
B | [b, c, d]
E | [z]
C | null
D | null
FWithNulls | [[a, b, c], [b, c, d], null, null, [z]]
F | [[a, b, c]]
Another example with array_contains
:
from pyspark.sql.functions import expr
(
df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
.withColumn('F', expr('filter(FWithNulls, x -> array_contains(x, "a"))'))
.show(vertical=True, truncate=False)
)
-RECORD 0-----------------------------------
A | [a, b, c]
B | [b, c, d]
E | [z]
C | null
D | null
FWithNulls | [[a, b, c], [b, c, d], null, null, [z]]
F | [[a, b, c]]
A good resource for understanding higher-order functions is the official databricks notebook higher-order-functions-tutorial-python.