0

I want to convert a date into a specific 'yy-mm-dd' format. The input can be in any format, for example, it can be twenty-sixth May twenty twenty or 26-05-2020 or 26/05/2020 or May 26,2020 or May twenty-sixth etc. The output for all the above scenarios should be 2020-05-26

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    You will have to handle every possible input format. I doubt there's any such library that can reliably do this since a single input format can evaluate to multiple outputs. For e.g. 19-03-02 can be evaluated as (2019,Mar,02), (2019,Feb,03), (2002,Mar,19), etc. – Shiva May 26 '20 at 12:28

1 Answers1

-1

You will have to use regular expressions. I wrote a function below which does some of what you asked.

It covers:

  • dd-mm-yyyy
  • dd/mm/yyyy
  • mm-dd-yyyy
  • dd/mm/yyyy

It does not cover May 26,2020 or May twenty-sixth (hoping someone can help with that, because I did not have enough time) But I hope it's a reasonable start at least. You can build on that if you know how to use regex.

I dont know what your input format is. I have assumed it is a DataFrame where the date column has a consistent format. Otherwise this exercise is not possible because you could have something like "02-02-2020" which could mean "dd-mm-yyyy" or "mm-dd-yyyy".

This function checks the entire column, get the "max" date (which hopefully contains a day which is over 12) and identify the day and month columns. Then depending on what format the column is, it accordingly reformats it to "yyyy-mm-dd".

import re

def clean_date_format(date_col):

        # replace "-" with "/"
        new_date = (date_col.str.replace('-', '/')).tolist()

        # check if US format
        first2_regex = r'^(\d*)\/.*'
        second2_regex = r'^.*\/(\d*)\/.*'

        r1 = re.compile(first2_regex)
        r2 = re.compile(second2_regex)

        first2_list = set([re.search(r1, x).group(1) for x in new_date])
        second2_list = set([re.search(r2, x).group(1) for x in new_date])

        first2_condition =  max(list(map(int,first2_list))) <= 12 # i.e. first column contains month
        second2_condition =  max(list(map(int,second2_list))) > 12 # i.e. second column contains day

        US_FORMAT = first2_condition & second2_condition


        if US_FORMAT:
            new_date = pd.DataFrame([datetime.strptime(d, "%m/%d/%Y").strftime("%Y-%m-%d") for d in new_date])
        else:
            new_date = pd.DataFrame([datetime.strptime(d, "%d/%m/%Y").strftime("%Y-%m-%d") for d in new_date])
        return new_date
Rabi
  • 11
  • 2