Using udf to split a cell and return first and last index

Question

I'm using PySpark to apply a function to get the cell value, split by ' ' and get first and last index of the split, but this column contains null values and I'm not managing to handle this null before split.

Here is my code:

def get_name(full_name):
    for i in full_name:
        if i is not None:
            name_list = full_name.split(' ')
            #first and last item of list
            return f"{name_list[0]} {name_list[-1]}"
        else:
            return full_name
    
udf_get_name = udf(lambda x: get_name(x), StringType())
df_parquet = df_parquet.withColumn("NameReduz", udf_get_name(col("FullName")))

It complains about the NoneType

This is what I'm expecting:

FullName	NameReduz
NAME SURNAME LAST	NAME LAST
NAME SURNAME1 SURNAME2 LAST	NAME LAST
null	null

you *can* achieve this using spark native functions as well. also, can you please share the error traceback? as for the null handling, you can start the function with `if full_name:` — samkart, Sep 21 '22 at 05:24

ZygD · Accepted Answer · 2022-09-21T08:26:13.187

I would suggest not using udf:

from pyspark.sql import functions as F
df_parquet = spark.createDataFrame(
    [('NAME SURNAME LAST',),
     ('NAME SURNAME1 SURNAME2 LAST',),
     (None,)],
    ['FullName'])

split_col = F.split("FullName", " ")
name_reduz = F.when(~F.isnull("FullName"), F.concat_ws(" ", split_col[0], F.element_at(split_col, -1)))
df_parquet = df_parquet.withColumn("NameReduz", name_reduz)

df_parquet.show(truncate=0)
# +---------------------------+---------+
# |FullName                   |NameReduz|
# +---------------------------+---------+
# |NAME SURNAME LAST          |NAME LAST|
# |NAME SURNAME1 SURNAME2 LAST|NAME LAST|
# |null                       |null     |
# +---------------------------+---------+

But if you want to, the following udf should work:

def get_name(full_name):
    if full_name is not None:
        name_list = full_name.split(' ')
        return f"{name_list[0]} {name_list[-1]}"

worked like a charm without the necessity of udf, thanks – dcrosseto Sep 21 '22 at 12:46 — dcrosseto, Sep 21 '22 at 12:46

Using udf to split a cell and return first and last index

1 Answers1