0

Using pyspark 2.4.0

I have the date column in the dateframe as follows :

date column in the dataframe

I need to convert it into DD-MM-YYYY format. I have tried a few solutions including the following code but it returns me null values,

df_students_2 = df_students.withColumn(
            'new_date',
                F.to_date(
                    F.unix_timestamp('dt', '%B %d, %Y').cast('timestamp')))

pic2

Note that different types of date format in the dt column. It would be easier if i could make the whole column in one format just for the ease of converting ,but since the dataframe is big it is not possible to go through each column and change it to one format. I have also tried the following code, just for the future readers i am including it, for the 2 types of date i tried to go through in a loop, but did not succeed.

def to_date_(col, formats=(datetime.strptime(col,"%B %d, %Y"), \
                           datetime.strptime(col,"%d %B %Y"), "null")):
    
    return F.coalesce(*[F.to_date(col, f) for f in formats]) 

Any ideas?

CD_NS
  • 309
  • 1
  • 5
  • 14
  • You will find the answer here - https://stackoverflow.com/questions/38080748/convert-pyspark-string-to-date-format – Prateek Jain Jul 05 '20 at 06:03
  • I still get null values with this solution you mentioned. Its a bit complicated since the column has string values of different types. – CD_NS Jul 05 '20 at 06:40

2 Answers2

0

Try this-

implemented in scala, but can be done pyspark with minimal change

// I've put the example formats, but just replace this list with expected formats in the dt column
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")

val newDF =  df_students.withColumn("new_date", coalesce(dt_formats.map(fmt => to_date($"dt", fmt)):_*))
Som
  • 6,193
  • 1
  • 11
  • 22
0

Try this should work...

from pyspark.sql.functions import to_date 
df = spark.createDataFrame([("Mar 25, 1991",), ("May 1, 2020",)],['date_str'])
df.select(to_date(df.date_str, 'MMM d, yyyy').alias('dt')).collect()

[Row(dt=datetime.date(1991, 3, 25)), Row(dt=datetime.date(2020, 5, 1))]

see also - Datetime Patterns for Formatting and Parsing

VirtualLogic
  • 706
  • 1
  • 10
  • 25