3

I started using pandas library about a fortnight back. Learning the new features. I would appreciate help on the following problem.

I have a column with dates in mixed format. These are the 2 formats present

  1. mm/dd/yyyy
  2. dd/mm/yyyy

An extract from the dataset :-

Dates  
6/5/2016  
7/5/2016  
7/5/2016  
7/5/2016  
9/5/2016  
9/5/2016  
9/5/2016  
9/5/2016  
5/13/2016  
5/14/2016  
5/14/2016  

I am struggling to convert these to a common format. I tried using pandas's 'to_datetime'. It does not work. I am also not sure how will using regular expressions help in this case.

Another piece of information. The dates are in sorted order. Can something be done using info.

EDIT1:

I understand that it is impossible to distinguish between 6/4/2016 and 5/6/2016, if we look at them alone. However, I was hoping that the fact the dates are in ascending order and the actual dataset is spread for over a year, there would be a method to make sense of it. Is anyone aware of a function which can make sense of the format given the fact that the dates are in ascending order ?

EDIT2: Sample of 2 months :- April and May 2016. Please note that there is no pattern. So please do not suggest any solution based on patterns of the data below.

4/1/2016 4/1/2016 4/3/2016 4/3/2016 3/4/2016 4/4/2016 4/4/2016 4/5/2016 4/5/2016 4/7/2016 4/7/2016 4/8/2016 4/8/2016 4/14/2016 4/16/2016 6/4/2016 7/4/2016 8/4/2016 11/4/2016 11/4/2016 11/4/2016 11/4/2016 11/4/2016 12/4/2016 12/4/2016 12/4/2016 13/4/2016 13/4/2016 13/4/2016 13/4/2016 14/04/2016 15/4/2016 16/4/2016 16/4/2016 18/4/2016 18/4/2016 19/4/2016 19/4/2016 20/4/2016 20/4/2016 21/4/2016 21/4/2016 21/4/2016 22/4/2016 23/4/2016 23/4/2016 25/4/2016 25/4/2016 26/4/2016 26/4/2016 26/4/2016 26/4/2016 26/4/2016 26/4/2016 29/4/2016 29/4/2016 29/4/2016 30/4/2016 2/5/2016 2/5/2016 3/5/2016 3/5/2016 3/5/2016 3/5/2016 4/5/2016 5/4/2016 5/4/2016 5/4/2016 6/5/2016 6/5/2016 7/5/2016 7/5/2016 7/5/2016 9/5/2016 9/5/2016 9/5/2016 9/5/2016 10/5/2016 10/5/2016 11/5/2016 11/5/2016 12/5/2016 5/13/2016 5/14/2016 5/14/2016 5/15/2016 5/16/2016 5/16/2016 5/16/2016 5/16/2016 5/16/2016 5/16/2016 5/16/2016 5/17/2016 5/17/2016 5/18/2016 5/18/2016 5/19/2016 5/19/2016 5/20/2016 5/20/2016 5/20/2016 5/20/2016 5/20/2016 5/21/2016 5/23/2016 5/23/2016 5/23/2016 5/23/2016 5/23/2016 5/23/2016 5/24/2016 5/24/2016 5/25/2016 5/26/2016 5/26/2016 5/26/2016 5/27/2016 5/27/2016 5/27/2016 5/27/2016 5/27/2016 5/27/2016 5/27/2016 5/28/2016 5/30/2016 5/30/2016

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Amit
  • 33
  • 1
  • 8
  • what is a fortnight? – zetavolt May 31 '16 at 06:13
  • http://stackoverflow.com/questions/29516616/formatting-inconsistent-date-data-with-pandas - I have checked this post..However this does not work for my data-set even using error='coerce'. – Amit May 31 '16 at 06:15
  • Are you sure it is inconsistent? If yes, I think it is impossible distinguish in some dates e.g. between `6/5/2016` and `5/6/2016` . First is May or June? – jezrael May 31 '16 at 06:17
  • @jezrael : Firstly, thank you for your kind reply ! yes, the dates column is inconsistent. This data is maintained on a spreadsheet. People have not followed a consistent date-format. I understand that it is impossible to distinguish between 6/4/2016 and 5/6/2016, if we look at them alone. However, I was hoping that the fact the dates are in ascending order and the actual dataset is spread for over a year, there would be a method to make sense of it. – Amit Jun 01 '16 at 16:10
  • @Amit - How many dates are in one month? How many are broken? 5%, 50%? – jezrael Jun 01 '16 at 16:19
  • @Amit - Can you add sample for two months? – jezrael Jun 01 '16 at 16:21
  • @jezrael : sample code added. For april month, the dominant format is dd/mm/yyyy but for may it is mm/dd/yyyy. – Amit Jun 01 '16 at 17:42

1 Answers1

1

The real problem is that there are ambiguous dates in your dataset (do you parse it as mm/dd/yyyy or dd/mm/yyyy if it could be either?? (I've been here, and we decided just to pick what the majority seemed to be; essentially the dataset was compromised... and we had to treat it as such).


If it's a Series then hitting it with pd.to_datetime seems to work:

In [11]: s = pd.Series(['6/5/2016', '7/5/2016', '7/5/2016', '7/5/2016', '9/5/2016', '9/5/2016', '9/5/2016', '9/5/2016', '5/13/2016', '5/14/2016', '5/14/2016'])

In [12]: pd.to_datetime(s)
Out[12]:
0    2016-06-05
1    2016-07-05
2    2016-07-05
3    2016-07-05
4    2016-09-05
5    2016-09-05
6    2016-09-05
7    2016-09-05
8    2016-05-13
9    2016-05-14
10   2016-05-14
Name: 0, dtype: datetime64[ns]

Note: If you had a consistent format you can pass it in explicitly:

In [13]: pd.to_datetime(s, format="%m/%d/%Y")
Out[13]:
0    2016-06-05
1    2016-07-05
2    2016-07-05
3    2016-07-05
4    2016-09-05
5    2016-09-05
6    2016-09-05
7    2016-09-05
8    2016-05-13
9    2016-05-14
10   2016-05-14
Name: 0, dtype: datetime64[ns]
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thanks for replying ! The dates that I have posted are actually for month of may. But pd.to_datetime() understands 7/5/2014 as 5th July 2016. I understand that it is impossible to distinguish between 6/4/2016 and 5/6/2016, if we look at them alone. However, I was hoping that the fact the dates are in ascending order and the actual dataset is spread for over a year, there would be a method to make sense of it. Are you aware of any function that could use this fact and make sense of the dates ? – Amit Jun 01 '16 at 16:12
  • You can use `dayfirst=True` (but that won't solve for ambiguity). Let me have a little think about being monotonic. Edit: It's rough cos brute forcing (to find a monotonic solution) is potentially 2^N (or more likely 2 ^ (N/3) as most dates are non-ambiguous). – Andy Hayden Jun 01 '16 at 21:02