0

Suppose I have a dataframe with columns A, B, C, D, E. Each of these can be comprised of a list of values or null. I would like to concatenate these values into a final column F that consists of a list of lists that ignores the null values and preserves the original columns' order.

Ex input: [a,b,c] | [b,c,d] | null | null | [z] Ex output: [[a,b,c], [b,c,d], [z]]

Unfortunately, concat_ws flattens everything, so I believe I must use a UDF. Does anyone have a solution to this problem?

1 Answers1

1

As you may know, it's common sense that UDF is generally not a good idea in pyspark and it seems to me that it's possible to solve it using sql functions.

You can start by creating your ArrayList of ArrayLists using array function:

df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))

And then remove the null values using array_except:

df.withColumn('F', array_except('FWithNulls', array(lit(None))))

Tested with pyspark 3.1.2:

from pyspark.sql.functions import lit, array, array_except
from pyspark.sql.types import StringType, ArrayType

(
  spark.createDataFrame(
    [
      {
        'A': ['a', 'b', 'c'],
        'B': ['b', 'c', 'd'],
        'E': ['z']
      }
    ]
  )
  .withColumn('C', lit(None).cast(ArrayType(StringType(), True)))
  .withColumn('D', lit(None).cast(ArrayType(StringType(), True)))
  .withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
  .withColumn('F', array_except('FWithNulls', array(lit(None))))
  .show(vertical=True, truncate=False)
)
-RECORD 0---------------------------------------------
 A          | [a, b, c]                               
 B          | [b, c, d]                               
 E          | [z]                                     
 C          | null                                    
 D          | null                                    
 FWithNulls | [[a, b, c], [b, c, d], null, null, [z]] 
 F          | [[a, b, c], [b, c, d], [z]]         

And as another option (that also works for more complex evaluations) you can use the filter + higher-order functions (available since version 2.4.0) as explained by @David Vrba in https://stackoverflow.com/a/57649346/18115573

from pyspark.sql.functions import expr

(
  df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
  .withColumn('F', expr('filter(FWithNulls, x -> x is not null)'))
  .show(vertical=True, truncate=False)
)
-RECORD 0-----------------------------------
 A          | [a, b, c]                     
 B          | [b, c, d]                     
 E          | [z]                           
 C          | null                          
 D          | null                          
 FWithNulls | [[a, b, c], [b, c, d], null, null, [z]] 
 F          | [[a, b, c]]   

Another example with array_contains:

from pyspark.sql.functions import expr

(
  df.withColumn('FWithNulls', array('A', 'B', 'C', 'D', 'E'))
  .withColumn('F', expr('filter(FWithNulls, x -> array_contains(x, "a"))'))
  .show(vertical=True, truncate=False)
)
-RECORD 0-----------------------------------
 A          | [a, b, c]                     
 B          | [b, c, d]                     
 E          | [z]                           
 C          | null                          
 D          | null                          
 FWithNulls | [[a, b, c], [b, c, d], null, null, [z]] 
 F          | [[a, b, c]]  

A good resource for understanding higher-order functions is the official databricks notebook higher-order-functions-tutorial-python.

rodrigo
  • 76
  • 2
  • Solution does not work in pyspark 2.4. I have an error message stating `: java.lang.ClassCastException: org.apache.spark.sql.types.NullType$ cannot be cast to org.apache.spark.sql.types.ArrayType`. Is there a workaround? – balloonpandatree Feb 04 '22 at 18:07
  • https://stackoverflow.com/questions/57627625/i-need-to-append-only-those-who-has-non-null-values-in-pyspark-dataframe The solution here worked for my version, and might as well for 3.0 and above. – balloonpandatree Feb 04 '22 at 19:04
  • Excellent! Once I'm on the computer I'll edit the answer adding the reference and application of higher-order functions for earlier versions. – rodrigo Feb 04 '22 at 21:57