2

I'm trying to read a file in with dates in the (UK) format 13/01/1800, however some of the dates are before 1667, which cannot be represented by the nanosecond timestamp (see http://pandas.pydata.org/pandas-docs/stable/gotchas.html#gotchas-timestamp-limits). I understand from that page I need to create my own PeriodIndex to cover the range I need (see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-oob) but I can't understand how I convert the string in the csv reader to a date in this periodindex.

So far I have:

span = pd.period_range('1000-01-01', '2100-01-01', freq='D')
df_earliest= pd.read_csv("objects.csv", index_col=0, names=['Object Id', 'Earliest Date'], parse_dates=[1], infer_datetime_format=True, dayfirst=True)

How do I apply the span to the date reader/converter so I can create a PeriodIndex / DateTimeIndex column in the dataframe ?

Richard
  • 109
  • 1
  • 11
  • I think it's best to load and strip the slashes then you can use the method in that link to create a periodindex, so after loading do `df['Earliest Date'] = df['Earliest Date'].str.replace('\\','')` so I'd not pass `parse_dates` param in `read_csv` in order to read those values in as strings – EdChum May 02 '16 at 17:23

1 Answers1

3

you can try to do it this way:

fn = r'D:\temp\.data\36987699.csv'

def dt_parse(s):
    d,m,y = s.split('/')
    return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')


df = pd.read_csv(fn, parse_dates=[0], date_parser=dt_parse)

Input file:

Date,col1
13/01/1800,aaa
25/12/1001,bbb
01/03/1267,ccc

Test:

In [16]: df
Out[16]:
        Date col1
0 1800-01-13  aaa
1 1001-12-25  bbb
2 1267-03-01  ccc

In [17]: df.dtypes
Out[17]:
Date    object
col1    object
dtype: object

In [18]: df['Date'].dt.year
Out[18]:
0    1800
1    1001
2    1267
Name: Date, dtype: int64

PS you may want to add try ... catch block in the dt_parse() function for catching ValueError: exceptions - result of int()...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I've wrapped it with a try/except that returns pd.NaT for invalid date strings (otherwise the split crashes out). – Richard May 02 '16 at 18:07