8

I'm getting quite an unexpected behaviour using pandas' pd.to_datetime. My pd.Series is as follows:

0         2017-01-06 14:37:16
1         2017-01-27 00:00:00
2         2017-01-18 00:00:00
3         2017-01-26 00:00:00
4                        None
                 ...         
454823    2019-10-22 11:20:03
454824                   None
454825    2019-07-11 00:00:00
454826                   None
454827    2019-07-15 00:00:00
Name: colx, Length: 454828, dtype: object

And when casting to datetime I'm getting:

pd.to_datetime(df.colx, errors='coerce')

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

While this seems to indicate that there are duplicate values in the index, I ran the following tests to check if that was the cause:

all(df.colx.index == range(df.colx.shape[0]))
# True

df.index.duplicated().any()
# False

So apparently there aren't any duplicate indexes. What could be causing this error?

yatu
  • 86,083
  • 12
  • 84
  • 139
  • Can you slice this dataframe into smaller parts still generating the error? What is the small you can get and still generate this error? – Scott Boston Oct 28 '19 at 18:15
  • @yatu, which version of `pandas` are you running? If < `0.25.0` try specifying `cache=True` in `pd.to_datetime`. I think you have an issue with both `pd.NaT` and `None` being null values in that column, which was fixed with 0.25 – ALollz Oct 28 '19 at 19:07
  • 1
    Thanks for the suggestions @alollz I'll try tomorrow – yatu Oct 28 '19 at 21:24
  • You're completely right @ALollz that just did it. My serius has many `None` and also `pd.NaT`s, and fillning with `np.NaT` solved it. My pandas version is 0.25.2 though. So apparently this is still causing problems. – yatu Oct 29 '19 at 07:47
  • Feel free to add as an answer. Otherwise I will. It will be useful to have one here, I'm surprised i found nothing related @ALollz – yatu Oct 29 '19 at 07:49
  • This appears to have regressed in pandas v1.2.5 released today (June 22nd, 2021). – Simon Ejsing Jun 23 '21 at 02:28

3 Answers3

26

The error you are receiving isn't related to your Index. It's related to this issue which should have been fixed as of 0.25.0. The issue was related to the following inconsistent handling of multiple null types when using pd.to_datetime

import pandas as pd

pd.core.algorithms.unique([pd.NaT, None])
#array([NaT, None], dtype=object) # That is, `pd.NaT` and `None` are unique

pd.Index([pd.NaT, None]).is_unique
#False   # However, Index.unique considers them duplicated

The handling of this different behavior should have been fixed with 0.25.0. If it's still giving you a problem the solution is to ensure you have a single representation of null values in the column you need to convert. In this case you can .fillna with the datetime null value.

pd.to_datetime(df.colx.fillna(pd.NaT), errors='coerce')

pd.__version__
#'0.25.0'

pd.to_datetime([pd.NaT, None])
#DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 2
    I had the same problem. Upgrading to the current latest Pandas version (1.1.2) did not help, but the work-around you suggested here did solve the problem for me. I guess the bug is not actually fixed or there was a regression. – Steve Jorgensen Sep 18 '20 at 03:50
  • 2
    It still seems to be an open issues. I just had the problem on Pandas version 1.2.0. Replacing with pd.NaT worked though. Thanks for the solution here. – CheradenineZK Jan 19 '21 at 16:12
3

Start from clearing a misunderstanding

I noticed such an error in your check procedure:

Calling df.index.duplicated().any() checks only that the index has no duplicates.

To investigate the issue, I created my input file from your data (just 10 data rows):

colx
2017-01-06 14:37:16
2017-01-27 00:00:00
2017-01-18 00:00:00
2017-01-26 00:00:00
None
2019-10-22 11:20:03
None
2019-07-11 00:00:00
None
2019-07-15 00:00:00

I read it calling read_csv, called df.duplicated().any() and the result was True, so there are duplicates in colx column.

Run df.duplicated() and you will see that True is printed for rows with index 6 and 8 (second and third instance of None string).

Another check: Run df.info() and you will get:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
colx    10 non-null object
dtypes: object(1)
memory usage: 208.0+ bytes

This confirms that no element has "true" None value. There are only strings containing "None".

Another remark: all(df.colx.index == range(df.colx.shape[0])) checks only that the index contains consecutive numbers, which says nothing about the content of colx.

How you read your DataFrame

I suppose your read your DataFrame calling e.g. read_csv, without any conversion, so colx column is of object (actually string) type.

In such case an attempt to call pd.to_datetime fails on the first element containing None (a string), because it can not be converted to datetime.

What to do

Try the following approach:

  1. When reading the DataFrame, pass na_values=['None'] parameter. It provides that elements containing None are not left as strings, but are converted to NaNs.

  2. Print the DataFrame (read from my limited source). Instead of None (a string) there will be NaN - a special case of float.

  3. Run df.info(). This time the printout will be:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10 entries, 0 to 9
    Data columns (total 1 columns):
    colx    7 non-null object
    dtypes: object(1)
    memory usage: 208.0+ bytes
    

    Note that there are only 7 non-null values, out of total 10, so the 3 remaining are "true" None values, which Pandas prints as NaN.

  4. Run pd.to_datetime(df.colx). This time there should be no error.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

I have same error, for me works next:

pd.to_datetime(df.colx.astype(str))
  • I doubt this solves the problem of `None` or `NaN` values. Have you tested it with data from OP? – craigb Oct 30 '22 at 20:56