So I have some observed time-series data with daily values, but there is a problem with it. The problem is that in some of the data, days don't exist (e.g. 1980-02-30 in the Year-Month-Day format). This is causing errors when I use the pandas.to_datetime function, because it won't parse an imaginary date. I have over 300 csvs to parse, and not all of them have imaginary dates, just some of them.
The solution that I can think of would be to just brute force every day and compare it to a real day in that year. I would like to know if there is a more elegant solution than this...
Below I have included a sample case that I created.
import numpy as np
import pandas as pd
days = [str(x) for x in range(1, 33)]
months = [str(x) for x in range(1, 13)]
years = [str(x) for x in range(1980, 1983)]
print(months)
date_list = []
for year in years:
for month in months:
for day in days:
if len(day) == 1:
day = '0{}'.format(day)
if len(month) == 1:
month = '0{}'.format(month)
date_list.append('{}-{}-{}'.format(year, month, day))
data = np.random.rand(1152, 2)
df = pd.DataFrame(data, index=date_list, columns=['Simulated', 'Observed'])
df.index = pd.to_datetime(df.index, infer_datetime_format=True)
The last line is where I found an error, I was wondering if there are tools that are already built to deal with this kind of thing.