5

Here's the setup:

I have two (integer-indexed) columns, start and month_delta. start has timestamps (its internal type is np.datetime64[ns]) and month_delta is integers.

I want to quickly produce the column that consists of the each datetime in start, offset by the corresponding number of months in month_delta. How do I do this?

Things I've tried that don't work:

  • apply is too slow.
  • You can't add a series of DateOffset objects to a series of datetime64[ns] dtype (or a DatetimeIndex).
  • You can't use a Series of timedelta64 objects either; Pandas silently converts month-based timedeltas to nanosecond-based timedeltas that are ~30 days long. (Yikes! What happened to not failing silently?)

Currently I'm iterating over all different values of month_delta and doing a tshift by that amount on the relevant part of a DatetimeIndex I created, but this is a horrible kludge:

new_dates = pd.Series(pd.Timestamp.now(), index=start.index)
date_index = pd.DatetimeIndex(start)
for i in xrange(month_delta.max()):
    mask = (month_delta == i)
    cur_dates = pd.Series(index=date_index[mask]).tshift(i, freq='M').index
    new_dates[mask] = cur_dates

Yuck! Any suggestions?

Ben Kuhn
  • 1,059
  • 2
  • 10
  • 25
  • 1
    What happens if you have a date of `2014-10-31` and you add 1 month do it, what date do you get? From your question, it seems like you don't like the 30 days to a month rule. What should the transformation be? – Mike Oct 30 '14 at 20:05
  • how big is your dataset? what would be quick enough? how frequently are you doing this? – acushner Oct 30 '14 at 20:08
  • also, which pandas version? – acushner Oct 30 '14 at 20:09
  • @Mike: it's fine if the solution does the same thing that Pandas DateOffsets do: `pd.Timestamp('2014-10-31') + pd.DateOffset(months=1) == Timestamp('2014-11-30 00:00:00') == pd.Timestamp('2014-10-30') + pd.DateOffset(months=1)`. – Ben Kuhn Oct 30 '14 at 20:17
  • @acushner: I have a script that I (would like to) run fairly frequently that spent about 300 seconds in this computation when using `apply` (more like 100 with the awful hack above). It goes through tens of thousands of rows, tens of times. Pandas 0.14.1. – Ben Kuhn Oct 30 '14 at 20:21
  • thanks. i have an idea for but it's not tested. will post shortly – acushner Oct 30 '14 at 20:48
  • still an issue in pandas 0.17.0 – seanv507 Jan 18 '16 at 11:58

3 Answers3

3

Here is a way to do it (by adding NumPy datetime64s with timedelta64s) without calling apply:

import pandas as pd
import numpy as np
np.random.seed(1)

def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
              seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
    years = np.asarray(years) - 1970
    months = np.asarray(months) - 1
    days = np.asarray(days) - 1
    types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
             '<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
    vals = (years, months, days, weeks, hours, minutes, seconds,
            milliseconds, microseconds, nanoseconds)
    return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
               if v is not None)

def year(dates):
    "Return an array of the years given an array of datetime64s"
    return dates.astype('M8[Y]').astype('i8') + 1970

def month(dates):
    "Return an array of the months given an array of datetime64s"
    return dates.astype('M8[M]').astype('i8') % 12 + 1

def day(dates):
    "Return an array of the days of the month given an array of datetime64s"
    return (dates - dates.astype('M8[M]')) / np.timedelta64(1, 'D') + 1

N = 10
df = pd.DataFrame({
   'start': pd.date_range('2000-1-25', periods=N, freq='D'),
   'months': np.random.randint(12, size=N)})
start = df['start'].values
df['new_date'] = combine64(year(start), months=month(start) + df['months'], 
                           days=day(start))

print(df)

yields

   months      start   new_date
0       5 2000-01-25 2000-06-25
1      11 2000-01-26 2000-12-26
2       8 2000-01-27 2000-09-27
3       9 2000-01-28 2000-10-28
4      11 2000-01-29 2000-12-29
5       5 2000-01-30 2000-06-30
6       0 2000-01-31 2000-01-31
7       0 2000-02-01 2000-02-01
8       1 2000-02-02 2000-03-02
9       7 2000-02-03 2000-09-03
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

i think something like this might work:

df['start'] = pd.to_datetime(df.start)
df.groupby('month_delta').apply(lambda x: x.start + pd.DateOffset(months=x.month_delta.iloc[0]))

there might be a better way to create a series of DateOffset objects and add that some way, but idk...

acushner
  • 9,595
  • 1
  • 34
  • 34
  • When I run this I get `TypeError: cannot use a non-absolute DateOffset in datetime/timedelta operations []`. Anyway, this shaves off a couple lines of code (fewer once you need to reconstruct the original column), but it's no more straightforward... do you expect it to be faster? – Ben Kuhn Oct 30 '14 at 21:04
  • yes, because it should take advantage of vectorized ufuncs – acushner Oct 30 '14 at 21:06
  • if you post a sample of your dataframe i can take a look (i.e. creation code) – acushner Oct 30 '14 at 21:06
  • Here: `df = pd.DataFrame({'start':pd.to_datetime(['2014-01-01 10:00:01', '2014-01-02 10:05:01', '2014-02-01 10:00:01']), 'month_delta':[1,2,1]})` – Ben Kuhn Oct 30 '14 at 21:13
  • thanks, i just took a look and can't get it to work, but i want to figure it out! (for own benefit). yeah, so i'll take a look tomorrow and let you know if i figure anything out. – acushner Oct 30 '14 at 22:42
0

I was not able to find a way without at least using an apply for setup but assuming that is okay:

df = pandas.DataFrame(
    [[datetime.date(2014,10,22), 1], [datetime.date(2014,11,20), 2]], 
    columns=['date','delta'])
>>> df
         date  delta
0  2014-10-22      1
1  2014-11-20      2

from dateutil.relativedelta import relativedelta

df['offset'] = df['delta'].apply(lambda x: relativedelta(months=x))
>>> df['date'] + df['offset']
0    2014-11-22
1    2015-01-20

Note that you must use the datetime from the datetime module rather than the numpy one or the pandas one. Since you are only creating the delta with the apply I would hope you experience a speedup.

Mike
  • 6,813
  • 4
  • 29
  • 50