2

So I have some observed time-series data with daily values, but there is a problem with it. The problem is that in some of the data, days don't exist (e.g. 1980-02-30 in the Year-Month-Day format). This is causing errors when I use the pandas.to_datetime function, because it won't parse an imaginary date. I have over 300 csvs to parse, and not all of them have imaginary dates, just some of them.

The solution that I can think of would be to just brute force every day and compare it to a real day in that year. I would like to know if there is a more elegant solution than this...

Below I have included a sample case that I created.

import numpy as np
import pandas as pd

days = [str(x) for x in range(1, 33)]
months = [str(x) for x in range(1, 13)]
years = [str(x) for x in range(1980, 1983)]

print(months)

date_list = []

for year in years:
    for month in months:
        for day in days:
            if len(day) == 1:
                day = '0{}'.format(day)
            if len(month) == 1:
                month = '0{}'.format(month)
            date_list.append('{}-{}-{}'.format(year, month, day))

data = np.random.rand(1152, 2)

df = pd.DataFrame(data, index=date_list, columns=['Simulated', 'Observed'])

df.index = pd.to_datetime(df.index, infer_datetime_format=True)

The last line is where I found an error, I was wondering if there are tools that are already built to deal with this kind of thing.

pythonweb
  • 1,024
  • 2
  • 11
  • 26

0 Answers0