0

I'm reading an excel file, with Pandas, containing Title and Date columns. When I manually set up a test version like this:

import pandas as pd

df = pd.DataFrame(data={'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'],
                    'Date': ['1991-11', '1991', '1991', '1991-10-31']})
print(df)

It prints as expected and most importantly, i'm able to sort it exactly how I would like by using print(df.sort_values('Date')), Below is the output i'm ultimately trying to achieve. As you can see there are instances of YYYY/MM/DD, YYYY/MM and just YYYY.

Title        Date
1  Movie2        1991
2  Movie3        1991
3  Movie4  1991-10-31
0  Movie1     1991-11

My issues occur when I try to run print(df.sort_values('Date')) with the actual Excel file i'm reading, using read_excel. I get TypeError: '<' not supported between instances of 'int' and 'str'

I've narrowed it down to how i'm entering YYYY-MM and YYYY-MM-DD dates into the Excel file. If I run it with just YYYY dates, it sorts correctly. In order to correctly display YYYY-MM and YYYY-MM-DD dates, in the Excel file, I must pre-pend a back-tick to them. Perhaps this is what is causing the problem.

Hopefully someone else has run into this before. Is there a way to read those dates with leading back-ticks correctly, using Pandas?

Or, is there a better way to enter dates into the Excel file for use with Pandas? (This may be just as much an Excel question as a Pandas question).

yodish
  • 733
  • 2
  • 10
  • 28
  • Is the `'Date'` column an object dtype (`DataFrame.dtypes`)? – wwii Feb 11 '20 at 22:54
  • Yes, it is. Oddly enough, `df.Date.sort_values` kind of works. It doesn't give an error but it also doesn't sort correctly. – yodish Feb 11 '20 at 23:18
  • 1
    what parameters did you use for the read_excel command? – Matts Feb 11 '20 at 23:22
  • [Any object with a `__lt__` method can be sorted](https://docs.python.org/3/howto/sorting.html#odd-and-ends). try to convert it with [`pd.to_datetime(df['Date'])`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) – wwii Feb 11 '20 at 23:25
  • related: [Converting different date time formats to MM/DD/YYYY format in pandas dataframe](https://stackoverflow.com/questions/45531489/converting-different-date-time-formats-to-mm-dd-yyyy-format-in-pandas-dataframe) ... [python dataframe converting multiple datetime formats](https://stackoverflow.com/questions/31147500/python-dataframe-converting-multiple-datetime-formats) – wwii Feb 11 '20 at 23:31
  • @Matts, so far i'm not using any parameters for read_excel. – yodish Feb 12 '20 at 00:14
  • @wwii, I did try `to_datetime`; but, get an error: `ValueError: No axis named Date for object type `. Thank you for the link, but i'm really trying to avoid converting YYYY dates to YYYY/01/01, etc. – yodish Feb 12 '20 at 00:18

1 Answers1

1

To fix the type error you can convert the ints in the 'Date' column to strings. This will ensure the dates are parsed as strings.

df['Date'] = df['Date'].astype('str')
df['pdDate'] = pd.to_datetime(df['Date'])
df.sort_values('pdDate')

Then create a new column, and convert the dates to datetime and sort the values. The 'Date' column will retain the same formats as in excel, and the dates will be sorted correctly. If you sort them as strings there may be errors, but they could still be sorted.

Matts
  • 1,301
  • 11
  • 30
  • Excellent! This sorts `Date` correctly, using `pdDate` to do it. Very creative, I appreciate your time! – yodish Feb 12 '20 at 02:46