1

I would like to understand if I can convert a column with mixed types (DateTime and Strings) to a PeriodIndex (for instance month).

I have the following DataFrame:

booking_date          ...      credit           debit
None                  ...      10185.00     -10185.00
2017-01-01 00:00:00   ...       1796.00          0.00
2018-07-01 00:00:00   ...       7423.20        -11.54
2017-04-01 00:00:00   ...       1704.00          0.00
2017-12-01 00:00:00   ...       1938.60      -1938.60
2018-12-01 00:00:00   ...       1403.47       -102.01
2018-01-01 00:00:00   ...       2028.00        -76.38
2019-01-01 00:00:00   ...        800.00       -256.98
Total                 ...      10185.00     -10185.00

I'm trying to apply the PeriodIndex to booking_date:

df['booking_date'] = pd.PeriodIndex(df['booking_date'].values, freq='M')

However, I receive the following error:

pandas._libs.tslibs.parsing.DateParseError: Unknown datetime string format, unable to parse: TOTAL

Anyway I can get around with this?

Thanks!

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
jcunhafonte
  • 429
  • 7
  • 17
  • It's probably happen because you have a String value('TOTAL') in this column or you need to cast the column from string to datetime using: ```pd.to_datetime()```. you will have to drop all the non valid dates and convert them to None or Null – Eitan Rosati Jul 30 '21 at 07:44
  • Thanks, @EitanRosati. I would like to maintain the exact same structure just mutating the booking_date column period. – jcunhafonte Jul 30 '21 at 08:54

2 Answers2

1

If need Periods only cannot mixing with strings:

df['booking_date'] = pd.to_datetime(df['booking_date'], errors='coerce').dt.to_period('m')
print (df)
  booking_date  ...    credit     debit
0          NaT  ...  10185.00 -10185.00
1      2017-01  ...   1796.00      0.00
2      2018-07  ...   7423.20    -11.54
3      2017-04  ...   1704.00      0.00
4      2017-12  ...   1938.60  -1938.60
5      2018-12  ...   1403.47   -102.01
6      2018-01  ...   2028.00    -76.38
7      2019-01  ...    800.00   -256.98
8          NaT  ...  10185.00 -10185.00

But it is possible:

orig = df['booking_date']

df['booking_date'] = pd.to_datetime(df['booking_date'], errors='coerce').dt.to_period('m')

df.loc[df['booking_date'].isna(), 'booking_date'] = orig
print (df)
  booking_date  ...    credit     debit
0         None  ...  10185.00 -10185.00
1      2017-01  ...   1796.00      0.00
2      2018-07  ...   7423.20    -11.54
3      2017-04  ...   1704.00      0.00
4      2017-12  ...   1938.60  -1938.60
5      2018-12  ...   1403.47   -102.01
6      2018-01  ...   2028.00    -76.38
7      2019-01  ...    800.00   -256.98
8        Total  ...  10185.00 -10185.00

print (df['booking_date'].apply(type))
0                             <class 'NoneType'>
1    <class 'pandas._libs.tslibs.period.Period'>
2    <class 'pandas._libs.tslibs.period.Period'>
3    <class 'pandas._libs.tslibs.period.Period'>
4    <class 'pandas._libs.tslibs.period.Period'>
5    <class 'pandas._libs.tslibs.period.Period'>
6    <class 'pandas._libs.tslibs.period.Period'>
7    <class 'pandas._libs.tslibs.period.Period'>
8                                  <class 'str'>
Name: booking_date, dtype: object

new = pd.to_datetime(df['booking_date'], errors='coerce').dt.to_period('m')

df['booking_date'] = np.where(new.isna(), df['booking_date'], new)
print (df)
  booking_date  ...    credit     debit
0         None  ...  10185.00 -10185.00
1      2017-01  ...   1796.00      0.00
2      2018-07  ...   7423.20    -11.54
3      2017-04  ...   1704.00      0.00
4      2017-12  ...   1938.60  -1938.60
5      2018-12  ...   1403.47   -102.01
6      2018-01  ...   2028.00    -76.38
7      2019-01  ...    800.00   -256.98
8        Total  ...  10185.00 -10185.00

print (df['booking_date'].apply(type))
0                             <class 'NoneType'>
1    <class 'pandas._libs.tslibs.period.Period'>
2    <class 'pandas._libs.tslibs.period.Period'>
3    <class 'pandas._libs.tslibs.period.Period'>
4    <class 'pandas._libs.tslibs.period.Period'>
5    <class 'pandas._libs.tslibs.period.Period'>
6    <class 'pandas._libs.tslibs.period.Period'>
7    <class 'pandas._libs.tslibs.period.Period'>
8                                  <class 'str'>
Name: booking_date, dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

In this case, you probably want to filter out the Total row (and maybe the None as well, depending on what that could be). The total can, probably (I dont know the exact data obviously), be derived from adding all the credit / debit values up, and you can always do that again, so you lose no information if you filter Total. To keep the dimensions clean, you likely wouldn't want it there anyways. To sum it up, use df["credit"].sum()

Filter Total in booking_date like this df = df[df["booking_date"] != "Total"]

More information about filtering: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/

Dustin
  • 483
  • 3
  • 13
  • To filter out None, you can use `df = df[~df["booking_date"].isnull()]` Where the ~ is basically a "not" operator – Dustin Jul 30 '21 at 08:06