Converting string type date values to date format in PySpark

Question

Using Spark 3.1, I am trying to convert string type value ("MM/dd/yyyy") in into date format ("dd-MM-yyyy"). There is a total of 5 date columns in my file and I want to change them into proper date format ("dd-MM-yy") from ("MM/dd/yy"). There are some values in these columns which are already in date format like 05-02-2022 ("dd-MM-YYYY") and some values in 10/23/2021 ("MM-dd-yyyy") format. I want to convert only those values which are in "MM-dd-YYYY" format to "dd-MM-yyyy" format. How can I achieve this?

Input:

df = pd.DataFrame([[10/23/2019, 09/13/2021], [06/16/2020, 03/16/2021], [09/06/2022, 12/23/2019], columns=['A', 'B'])

Output will be like 23-10-2019, 13-09-2021

My code:

df = df.withColumn('date_col', to_date('Date_col', 'dd-MM-yy'))

The code is running fine, but it's returning undefined in output for date column. As I have 5 date columns, is it possible to do it using a for loop?

Hi @DerekO my df looks like suppose i have one date column in my dataframe which have value as 10/21/2022 in excel when i am seeing this it is shown as general not as date..i want to change this value into dd-mm-yyyy format (21-10-2022) — DataWorld, Oct 09 '22 at 19:24
can you include a sample of your dataframe as formatted text? there are some good examples of how to best do that [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Derek O, Oct 09 '22 at 19:32
Note - spark's default date format is `yyyy-MM-dd` which is of DateType(). all other formats are strings for spark. looking at the sample, your source format seems to be `MM/dd/yyyy`. `to_date` will parse the string based on the provided source format and return the default formatted DateType() result. — samkart, Oct 11 '22 at 07:02

score 1 · Answer 1 · answered Oct 09 '22 at 19:49

1

to_date changes a column to date type, so you need to specify the format your string is in but not the target format. You should use date_format function which finally change the date column to another string of another format.

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

df.select(date_format(to_date(df.t, 'MM/dd/yyyy'), "dd-MM-yyyy").alias('date')).show()

If you only have 5 columns to change to the date type and this number will not change dynamically, I suggest you just do:

df.withColumn("col1", date_format(to_date("dt1", MM/dd/yyyy'), "dd-MM-yyyy"))
.withColumn("col2", date_format(to_date("dt2", MM/dd/yyyy'), "dd-MM-yyyy"))
.withColumn("col3", date_format(to_date("dt3", MM/dd/yyyy'), "dd-MM-yyyy"))
.withColumn("col4", date_format(to_date("dt4", MM/dd/yyyy'), "dd-MM-yyyy"))
.withColumn("col5", date_format(to_date("dt5", MM/dd/yyyy'), "dd-MM-yyyy"))

answered Oct 09 '22 at 19:49

Young

536
2
19

Hi @Young what is 'dt' defining in your code – DataWorld Oct 09 '22 at 19:56
Is it creating a new column because i dont want to create a new column i already have those – DataWorld Oct 09 '22 at 19:57
it is the column of date in your origin format. – Young Oct 09 '22 at 19:58
if you want to replace the old column: df.withColumn("Date_col", date_format(to_date("Date_col", MM/dd/yyyy'), "dd-MM-yyyy")) – Young Oct 09 '22 at 20:00

score 1 · Answer 2 · answered Oct 09 '22 at 19:57

Here are my 2 cents

Created a data frame with 2 columns(You can create as many columns as you need)

 df = spark.createDataFrame([('05/22/2022','10/21/2022')],schema=['Date_col1','Date_col2'])

Use the to_date() followed by date_format()

 from pyspark.sql.functions import *

 df1 = df.withColumn('date_col1',date_format(to_date(col("Date_col1"),"MM/dd/yyyy"),'dd-MM-yyyy'))\
         .withColumn('date_col2',date_format(to_date(col("Date_col2"),"MM/dd/yyyy"),'dd-MM-yyyy'))\

Print the data frame
```
 df1.show()
```

Please check the below image:

ZygD · Accepted Answer · 2022-10-10T08:51:11.483

Example input df (all the columns are of the string date type):

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('x', '01/31/2021', '12/31/2020'),
     ('y', '1/1/2020', '1/18/2020'),
     ('z', '15-01-2020', '20-10-2020')],
    ['id', 'c1', 'c2'])

The following will convert every column into date type:

for c in ['c1', 'c2']:
    df = df.withColumn(
        c, 
        F.when(F.col(c).rlike('\d{1,2}/\d{1,2}/\d{4}'), F.to_date(c, 'M/d/yyyy'))
         .when(F.col(c).rlike('\d{1,2}-\d{1,2}-\d{4}'), F.to_date(c, 'd-M-yyyy'))
    )

df.show()
# +---+----------+----------+
# | id|        c1|        c2|
# +---+----------+----------+
# |  x|2021-01-31|2020-12-31|
# |  y|2020-01-01|2020-01-18|
# |  z|2020-01-15|2020-10-20|
# +---+----------+----------+

rlike method checks if the string conforms to the specific regex pattern, and if it does, when condition applies a function.
\d{1,2}/\d{1,2}/\d{4} checks the format M/d/yyyy
\d{1,2}-\d{1,2}-\d{4} checks the format d-M-yyyy
If you have some other format, you will get null. To not get null, you would need to include that other format into this function too.

Spark does not support date types formatted in some other way except for yyyy-MM-dd. If you need another format, you will need to again convert the date type into string type, but with the format which you need:

for c in ['c1', 'c2']:
    df = df.withColumn(
        c, 
        F.date_format(
            F.when(F.col(c).rlike('\d{1,2}/\d{1,2}/\d{4}'), F.to_date(c, 'M/d/yyyy'))
             .when(F.col(c).rlike('\d{1,2}-\d{1,2}-\d{4}'), F.to_date(c, 'd-M-yyyy')),
            "dd-MM-yyyy"
        )
    )

df.show()
# +---+----------+----------+
# | id|        c1|        c2|
# +---+----------+----------+
# |  x|31-01-2021|31-12-2020|
# |  y|01-01-2020|18-01-2020|
# |  z|15-01-2020|20-10-2020|
# +---+----------+----------+

Converting string type date values to date format in PySpark

3 Answers3

Linked