2

I have written a UDF to replace a few specific date values in a column named "latest_travel_date" with 'NA'. However, this column also contains many null values, so I have handled this also in the UDF. (please see below)

Query:
def date_cleaner(date_col):
    if type(date_col) == NoneType:
        pass
    else:
        if year(date_col) in ('1899','1900'):
            date_col= 'NA'
        else:
            pass
    return date_col

date_cleaner_udf = udf(date_cleaner, DateType())

Df3= Df2.withColumn("latest_cleaned", date_cleaner_udf("latest_travel_date"))

However, I am continuously getting the error: NameError: global name 'NoneType' is not defined

Can anyone please help me to resolve this?

Community
  • 1
  • 1
Preyas
  • 773
  • 1
  • 7
  • 12

3 Answers3

4

This issue could be solved by two ways.

If you try to find the Null values from your dataFrame you should use the NullType.

Like this:

if type(date_col) == NullType

Or you can find if the date_col is None like this:

if date_col is None

I hope this help.

Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • I tried with both the options as you suggested, but it ends up in the error:AttributeError: 'NoneType' object has no attribute '_jvm' – Preyas Aug 19 '16 at 14:31
  • Can you do something? Can you add to your question a part of your dataframe. I did the same thing as you did in my spark. But this issue didn't happens. We need to see this dataFrame. – Thiago Baldim Aug 19 '16 at 16:03
1

The problem is this line:

if type(date_col) == NoneType:

It looks like you actually want:

if date_col is None:
Michael Aaron Safyan
  • 93,612
  • 16
  • 138
  • 200
0

As pointed out by Michael, you cannot do

if type(date_col) == NoneType:

However, changing that to None won't complete the task. There is another issue with

date_col= 'NA'

It is of StringType but you declared the return type to be DateType. Your _jvm error in the comment was complaining this mis-match of data types.

It seems you just want to mark date_col to be None when it is 1899 or 1900, and drop all Nulls. If so, you can do this:

def date_cleaner(date_col):
    if date_col:
        if year(date_col) in ('1899','1900'):
            return None

    return date_col

date_cleaner_udf = udf(date_cleaner, DateType())

Df3= Df2.withColumn("latest_cleaned", date_cleaner_udf("latest_travel_date")).dropna(subset=["latest_travel_date"])

This is because DateType could either take a valid datetime or Null (by default). You could do dropna to "clean" your dataframe.

shuaiyuancn
  • 2,744
  • 3
  • 24
  • 32