1

I have a time series of covariance matrices stored as a MultiIndex (let's called it "A") and a time series of scalars stored as a DataFrame (let's call it "b").

"A" is of shape "k" x "n" x "n" and "b" is of shape "k" x "n", so I have a MultiIndex of "k" dates of covariance matrices with shape "n" x "n" that I would like to multiply by a "k" x "n" x 1 DataFrame of scalars to obtain a "k" x "n" x 1 DataFrame.

When I try something like A.multiply(b), it does not work because the MultiIndex dimensions are (kxn, n) and do not strictly match the DataFrame dimensions of (k, n).

I am able to perform the calculation using list comprehension and converting the MultiIndex and DataFrame to numpy arrays, but this takes a very, very long time, so I must be doing it very inefficiently.

An example of this is:

[np.dot( np.array( A.loc[timestamp,:] ), np.array( b.loc[timestamp,:] ) ) for timestamp in b.index]

This takes an incredibly long time to compute. Is there a fast way to perform this calculation?

The User
  • 55
  • 3
  • 11
  • Which `n` should `b`’s `n`-dimension align with? The one in the multi-index or the one as columns? – Cimbali Aug 18 '21 at 09:55
  • 'n' is the number of entities I am looking at, so 'A' is the covariance of 'k' dates with each date having 'n' rows and 'n' columns for the 'n' entities in the covariance matrix. 'b' has 'k' dates and 'n' columns for each date. – The User Aug 18 '21 at 19:02
  • Yeah @TheUser I was trying to figure out whether you multiply the rows or the columns. But in the end I reproduced your code’s results so it should be fine. – Cimbali Aug 18 '21 at 19:04

1 Answers1

2

Here’s random-valued dataframes of size A 5×3×3 and and b 5×3:

>>> A
                              0         1         2
2021-08-14 11:14:00 0  0.309559  0.286455  0.080467
                    1  0.996128  0.666138  0.354322
                    2  0.113316  0.208389  0.197855
2021-08-14 11:15:00 0  0.299459  0.650561  0.481504
                    1  0.570308  0.298572  0.677013
                    2  0.476066  0.610945  0.750575
2021-08-14 11:16:00 0  0.861801  0.692752  0.046450
                    1  0.587891  0.389874  0.779039
                    2  0.009947  0.647356  0.735746
2021-08-14 11:17:00 0  0.990027  0.185747  0.286276
                    1  0.831238  0.474372  0.459076
                    2  0.885953  0.768626  0.866064
2021-08-14 11:18:00 0  0.952294  0.106072  0.477348
                    1  0.370116  0.646081  0.873394
                    2  0.439066  0.568404  0.227528
>>> b
                            0         1         2
2021-08-14 11:14:00  0.113316  0.208389  0.197855
2021-08-14 11:15:00  0.476066  0.610945  0.750575
2021-08-14 11:16:00  0.009947  0.647356  0.735746
2021-08-14 11:17:00  0.885953  0.768626  0.866064
2021-08-14 11:18:00  0.439066  0.568404  0.227528

If you want to multiply each row in A by a single value of b, i.e. the b columns should align to the second level of A indexes, you should use stack:

>>> A.mul(b.stack(), axis='index')
                              0         1         2
2021-08-14 11:14:00 0  0.035078  0.032460  0.009118
                    1  0.207582  0.138816  0.073837
                    2  0.022420  0.041231  0.039147
2021-08-14 11:15:00 0  0.142562  0.309710  0.229227
                    1  0.348427  0.182411  0.413618
                    2  0.357323  0.458560  0.563363
2021-08-14 11:16:00 0  0.008572  0.006891  0.000462
                    1  0.380575  0.252387  0.504315
                    2  0.007318  0.476290  0.541322
2021-08-14 11:17:00 0  0.877118  0.164563  0.253627
                    1  0.638911  0.364615  0.352858
                    2  0.767292  0.665679  0.750067
2021-08-14 11:18:00 0  0.418120  0.046573  0.209587
                    1  0.210376  0.367235  0.496441
                    2  0.099900  0.129328  0.051769

On the other hand, if you want the b columns to align with the A columns, you could use pd.DataFrame.align(), which returns aligned versions of both dataframes. Here A will be unchanged as confirmed with .compare(), and b_aligned will have its rows repeated per second-level index to match A’s indexing:

>>> b_aligned, A_aligned = b.align(A, level=0)
>>> A_aligned.compare(A)
Empty DataFrame
Columns: []
Index: []
>>> b_aligned
                              0         1         2
2021-08-14 11:14:00 0  0.113316  0.208389  0.197855
                    1  0.113316  0.208389  0.197855
                    2  0.113316  0.208389  0.197855
2021-08-14 11:15:00 0  0.476066  0.610945  0.750575
                    1  0.476066  0.610945  0.750575
                    2  0.476066  0.610945  0.750575
2021-08-14 11:16:00 0  0.009947  0.647356  0.735746
                    1  0.009947  0.647356  0.735746
                    2  0.009947  0.647356  0.735746
2021-08-14 11:17:00 0  0.885953  0.768626  0.866064
                    1  0.885953  0.768626  0.866064
                    2  0.885953  0.768626  0.866064
2021-08-14 11:18:00 0  0.439066  0.568404  0.227528
                    1  0.439066  0.568404  0.227528
                    2  0.439066  0.568404  0.227528
>>> A_aligned.mul(b_aligned)
                              0         1         2
2021-08-14 11:14:00 0  0.035078  0.059694  0.015921
                    1  0.112877  0.138816  0.070104
                    2  0.012840  0.043426  0.039147
2021-08-14 11:15:00 0  0.142562  0.397457  0.361405
                    1  0.271504  0.182411  0.508149
                    2  0.226639  0.373254  0.563363
2021-08-14 11:16:00 0  0.008572  0.448457  0.034176
                    1  0.005848  0.252387  0.573175
                    2  0.000099  0.419070  0.541322
2021-08-14 11:17:00 0  0.877118  0.142770  0.247933
                    1  0.736438  0.364615  0.397589
                    2  0.784913  0.590785  0.750067
2021-08-14 11:18:00 0  0.418120  0.060292  0.108610
                    1  0.162505  0.367235  0.198722
                    2  0.192779  0.323083  0.051769

I’m only guessing to what you’re really doing as you’re not specifying the axis etc., but this gives the same results as your code:

>>> A.mul(b.align(A, level=0)[0]).sum(axis='columns').unstack(1)
                            0         1         2
2021-08-14 11:14:00  0.110693  0.321797  0.095413
2021-08-14 11:15:00  0.901424  0.962065  1.163256
2021-08-14 11:16:00  0.491205  0.831409  0.960491
2021-08-14 11:17:00  1.267821  1.498642  2.125765
2021-08-14 11:18:00  0.587022  0.728462  0.567631
Cimbali
  • 11,012
  • 1
  • 39
  • 68
  • Thank you @Cimbali, this seems to work and has shown an improvement in speed. I appreciate your advice. It still seems to take a very long time given the amount of data I am processing (ca 4.5m timestamps). Are there any other tips for improving the speed of this operation? – The User Aug 21 '21 at 19:56