I have some very sparse, multidimensional time series, with activity at certain times and zero activity at other times. I'm representing the data in Pandas as a SparseDataframe
with a MultiIndex
. The work flow is to perform calculations on the small set of data that are non-zero, then put the results into the large sparse dataframe. Later I will do calculations on that sparse dataframe (namely tracking the change in activity over time, including the zero activity areas).
The problem is in putting the small set of data into the sparse dataframe. Below is a much smaller dataset than what I will eventually use.
With a regular index it's ok:
import pandas as pd
full_index = arange(10000000) #10 million
sparse_data = full_index[:20000]
sparse_df = pd.SparseDataFrame(index=full_index)
data_df = pd.DataFrame(index=sparse_data,columns=['Data'], data=1) #Calculations would go here.
%time sparse_df.join(data_df)
CPU times: user 888 ms, sys: 225 ms, total: 1.11 s
Wall time: 1.56 s
With a MultiIndex
it's much slower!
full_index = pd.MultiIndex.from_product([arange(2),
arange(5),
arange(1000000)]) #Still 10 million
sparse_data = full_index[:20000]
sparse_df = pd.SparseDataFrame(index=full_index)
data_df = pd.DataFrame(index=sparse_data,columns=['Data'], data=1)
%time sparse_df.join(data_df)
CPU times: user 13.8 s, sys: 1.82 s, total: 15.6 s
Wall time: 16.6 s #10x slower!
I thought the issue might be in using join
, so I tried another route. It was faster but did not solve the problem.
#regular Index
%time sparse_df['Data'] = data_df.reindex(full_index).to_sparse()['Data'] #The reindexing is the slow part
CPU times: user 318 ms, sys: 129 ms, total: 448 ms
Wall time: 448 ms
#MultiIndex
%time sparse_df['Data'] = data_df.reindex(full_index).to_sparse()['Data'] #The reindexing is the slow part
CPU times: user 9.33 s, sys: 1.02 s, total: 10.3 s
Wall time: 10.4 s #20x slower!
It is indeed faster to use reindex
instead of join
, but now MultiIndex
is even slower, comparatively!
What are my options? Is there a way to achieve the speed of a regular index with the MultiIndex
? Is there an elegant way to make a regular index function like the MultiIndex
? Should I be rethinking how to do all of this?