0

I am filtering a dataframe by dates to produce two seperate versions:

  1. Data from only today's date
  2. Data from the last two years

However, when I try to filter on the date, it seems to miss dates that are within the last two years.

date_format = '%m-%d-%Y'  # desired date format

today = dt.now().strftime(date_format)  # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date()  # converting 'today' into a datetime object

today = today.strftime(date_format)
two_years = today - relativedelta(years=2)  # date is today's date minus two years. 
two_years = two_years.strftime(date_format)

# normalizing the format of the date column to the desired format 
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)

df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]

Which results in:

all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']

The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.

Whitewater
  • 297
  • 2
  • 12
  • You're converting the `df_data['date']` column to string, and then trying to use the `>=` operator to compare with `two_years` which is also a string representation of a date. This won't produce the expected results, you need both the column and the `two_years` variable to be datetime types. – Toby Petty Jul 17 '20 at 18:35
  • @TobyPetty: I think string comparison would also work, but that would require %Y%m%d format. And I would not suggest doing it since it isn't *logically* what is intended here. – FObersteiner Jul 17 '20 at 18:37
  • 1
    Yes true if you converted to a suitable string format you could filter on strings correctly; but I would agree it would be better to remove all these unnecessary type conversions and just filter on datetimes. – Toby Petty Jul 17 '20 at 18:44

2 Answers2

0

Your datatype conversions are the problem here. You could do this:

today = dt.now()  # today's date. Will always result in today's date
two_years = today - relativedelta(years=2)  # date is today's date minus two years. 

This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.

two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()
Kay
  • 13
  • 4
0

you don't need to convert anything to string, simply work with datetime dtype. Ex:

import pandas as pd

df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})

today = pd.Timestamp('now')

print(df[df['date'].dt.date == today.date()])
#         date
# 0 2020-07-17

print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
#         date
# 1 2020-07-15
# 2 2019-08-01

What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • It appears I have an issue where the df['date'] is stored as a str already: -- AttributeError: Can only use .dt accessor with datetimelike values – Whitewater Jul 17 '20 at 19:11
  • @Whitewater: did you try to cast to datetime? I mean like `df['date'] = pd.to_datetime(df['date'])`? – FObersteiner Jul 17 '20 at 19:24
  • Yes, and I get two separate errors `TypeError: Unrecognized value type: ` and `ParserError("Unknown string format: %s", timestr) dateutil.parser._parser.ParserError: Unknown string format: date`. EDIT: I see my problem now, I forgot to add the "df" in front of ['date'] when casting to datetime – Whitewater Jul 17 '20 at 19:26