1

I have a dataframe with one column, called maturity_dates. I also have a DateTimeIndex, called simulationdates.

I want to create a dataframe, where each cell is the simulationdates value minus the MAT_DATE. Additionally, I want the figure to be minimum 0, and represented in years.

The code below does the job, but it is very slow on large dataframes. Is there a quicker way without the for loop?

import numpy as np
import pandas as pd
import time

maturity_dates_raw = pd.DataFrame({'year': [2015, 2016, 2017, 2018, 2019, 2020, 2021]*40,
                   'month': [2, 3, 3, 3, 3, 3, 3]*40,
                   'day': [4, 5, 5, 5, 5, 5, 5]*40})
maturity_dates = pd.to_datetime(maturity_dates_raw)
date = pd.to_datetime("4th of July, 2015")
simulationdates = date + pd.to_timedelta(np.arange(5000), 'D')


t0 = time.time()
trade_m = pd.DataFrame(index=maturity_dates.index, columns=simulationdates)
mat_date = pd.to_datetime(maturity_dates)
dates = pd.DatetimeIndex.to_series(simulationdates)
for i in range(trade_m.shape[1]):
    trade_m.iloc[:, i] = np.maximum(
        (mat_date - dates[i]).astype('timedelta64[D]') / 365.0, 0.0)
t1 = time.time()
print('Time to maturity done in {} seconds.'.format(np.round(t1 - t0, 4)))
print(trade_m)

Time to maturity done in 0.018 seconds.
   2015-07-04  2015-07-05  2015-07-06  2015-07-07  2015-07-08
0    0.000000    0.000000    0.000000    0.000000    0.000000
1    0.671233    0.668493    0.665753    0.663014    0.660274
2    1.671233    1.668493    1.665753    1.663014    1.660274
3    2.671233    2.668493    2.665753    2.663014    2.660274
4    3.671233    3.668493    3.665753    3.663014    3.660274
5    4.673973    4.671233    4.668493    4.665753    4.663014
6    5.673973    5.671233    5.668493    5.665753    5.663014

muzzex
  • 157
  • 2
  • 16

2 Answers2

3
# Setup.
maturity_dates_raw = pd.DataFrame(
    {'year': [2015, 2016, 2017, 2018, 2019, 2020, 2021],
     'month': [2, 3, 3, 3, 3, 3, 3],
     'day': [4, 5, 5, 5, 5, 5, 5]}
)
n = 40  # Adjusts size of data (i.e. rows = n * 7).
simulation_date_count = 5000  # Adjusts number of simulation dates (i.e. columns).
maturity_dates = pd.to_datetime(pd.concat([maturity_dates_raw] * n, ignore_index=True))
date = pd.to_datetime("4th of July, 2015")
simulationdates = date + pd.to_timedelta(np.arange(simulation_date_count), 'D')

# OP Result.
trade_m = pd.DataFrame(index=maturity_dates.index, columns=simulationdates)
mat_date = pd.to_datetime(maturity_dates)
dates = pd.DatetimeIndex.to_series(simulationdates)
for i in range(trade_m.shape[1]):
    trade_m.iloc[:, i] = np.maximum(
        (mat_date - dates[i]).astype('timedelta64[D]') / 365.0, 0.0)
result_op = trade_m

We can use a dictionary comprehension to calculate the difference between the maturity dates and the simulation dates.

# Method 1.
result_1 = pd.DataFrame(
    {sim_date: [(maturity - sim_date).days / 365 
                for maturity in maturity_dates] 
     for sim_date in simulationdates}
).clip(lower=0)
>>> result
   2015-07-04  2015-07-05  2015-07-06  2015-07-07  2015-07-08
0    0.000000    0.000000    0.000000    0.000000    0.000000
1    0.671233    0.668493    0.665753    0.663014    0.660274
2    1.671233    1.668493    1.665753    1.663014    1.660274
3    2.671233    2.668493    2.665753    2.663014    2.660274
4    3.671233    3.668493    3.665753    3.663014    3.660274
5    4.673973    4.671233    4.668493    4.665753    4.663014
6    5.673973    5.671233    5.668493    5.665753    5.663014

We can also just calculate years to maturity for the the first column, then subtract the number of days in the simulation dates / 365. This means that we treat all the other columns as floats, which gives significant performance benefits.

# Method 2.
day_deltas = np.concatenate(
    ([0], np.array((simulationdates[1:] - simulationdates[:-1]).days).cumsum())) / 365
years_to_maturity = (maturity_dates - simulationdates[0]).dt.days / 365
result_2 = pd.DataFrame(
    {sim_date: years_to_maturity - day_delta 
     for sim_date, day_delta in zip(simulationdates, day_deltas)}
).clip(lower=0)

Timings and equality checks

# OP Method: 1min 2s ± 2.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Method 1: 27.7 s ± 2.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Method 2: 852 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> result_op.shape, result_1.shape, result_2.shape
((280, 5000), (280, 5000), (280, 5000))

>>> result_op.sub(result_1).abs().sum().sum()
0.0

>>> result_1.sub(result_2).abs().sum().sum()
5.7462090641280383e-11

>>> ((result_op.index == result_1.index) & (result_1.index == result_2.index)).all()
True

>>> ((result_op.columns == result_1.columns) & (result_1.columns == result_2.columns)).all()
True
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Thanks, I have made the data sets larger. On my computer, it saves some time, going down from 70 seconds to 50 seconds. Is there any even faster way? – muzzex Feb 21 '20 at 21:30
  • I get error message: Traceback (most recent call last): File "", line 2, in File "C:\Users\Ola\Anaconda3\envs\leamconsulting\lib\site-packages\pandas\core\frame.py", line 407, in __init__ raise TypeError("data argument can't be an iterator") TypeError: data argument can't be an iterator – muzzex Feb 21 '20 at 21:53
  • @muzzex Which line causes the error and what version of pandas are you using? – Alexander Feb 21 '20 at 21:56
  • Traceback (most recent call last): File "", line 1, in AttributeError: 'Int64Index' object has no attribute 'to_numpy' '0.23.4' pandas version – muzzex Feb 21 '20 at 21:58
  • An old version of pandas I see (pre 0.24.0)... Try `day_deltas = np.array((simulationdates[1:] - simulationdates[:-1]).days).cumsum()` instead. – Alexander Feb 21 '20 at 22:01
  • I updated my pandas, but the output doesn't match the correct dimension. Not after editing my first post. – muzzex Feb 21 '20 at 22:30
  • Thanks, especially Method 2 is fast. Down from around 80 sec in my code to less than 2 sec! – muzzex Feb 22 '20 at 13:47
  • Depending on how many overlapping maturity dates you have in your actual data, you may want to first get the unique values, calculate the years, merge back to the first column, and then apply the rest of my method for the remaining columns. – Alexander Feb 22 '20 at 18:14
1

Another approach using itertools and product (cartesian):

import numpy as np
import pandas as pd
import itertools

df = pd.DataFrame(itertools.product(simulationdates, maturity_dates)).\
rename(columns={0:'simulationdates',1:'maturity_dates'})

df = df.assign(dif = np.maximum((df.maturity_dates-df.simulationdates).dt.days/365,0)).\
pivot_table(index='maturity_dates',columns='simulationdates', values ='dif')

df = pd.merge(maturity_dates.to_frame("maturity_dates"), df, \
             left_on = "maturity_dates", right_index = True).\
             sort_index().drop(columns="maturity_dates")

Please note final merge, which is necessary, since maturity_dates is not unique.

ipj
  • 3,488
  • 1
  • 14
  • 18
  • raise TypeError("data argument can't be an iterator") TypeError: data argument can't be an iterator – muzzex Feb 21 '20 at 21:56
  • Strange, I edited to full code that works to me, after restarting kernel to ensure clear environment. Can You check again? Probably You have older version of pandas, maybe just update? – ipj Feb 21 '20 at 22:01
  • I updated pandas and numpy, and it's working now. I updated my data set, by adding "*40" to the maturity dates. The output dimension doesn't match then. – muzzex Feb 21 '20 at 22:18
  • What do You mean by correct output dimmension? If it refers to index just add `.reset_index()` at the end. – ipj Feb 21 '20 at 22:42
  • 1
    The pivot table needs to be joined back to the maturity dates to get the correct dimensions because there are multiple maturity dates with the same values. – Alexander Feb 22 '20 at 03:44
  • Thanks @Alexander, I haven't noticed code change. Now I added final join to not unique `maturity_dates`. – ipj Feb 22 '20 at 09:43
  • Thank you! This answer is also very fast, however slightly slower than solution 2 above. 2.5 sec instead of 1.7 sec on my computer. – muzzex Feb 22 '20 at 13:48
  • There are only seven unique maturity dates, so my guess is that this would be much slower if one had many unique maturities. – Alexander Feb 22 '20 at 18:12
  • However date difference is calculated now for all dates not only for distinct dates. This suggests possibility to optimize by limiting calculations only for unique dates as mentioned in your solution. – ipj Feb 22 '20 at 19:11