One answer is in terms of data frame size. I have a data frame with 50M rows
df_Usage.info()
output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49991484 entries, 0 to 49991483
Data columns (total 7 columns):
BILL_ACCOUNT_NBR int64
MM_ADJ_BILLING_YEARMO int64
BILLING_USAGE_QTY float64
BILLING_DAYS_CNT int64
TARIFF_RATE_TYP object
READ_FROM object
READ_TO object
dtypes: float64(1), int64(3), object(3)
memory usage: 2.6+ GB
Setting the first two columns as index (one includes time)
df_Usage['MM_ADJ_BILLING_YEARMO'] = pd.to_datetime(df_Usage['MM_ADJ_BILLING_YEARMO'], format='%Y%m')
df_Usage.set_index(['BILL_ACCOUNT_NBR','MM_ADJ_BILLING_YEARMO'],inplace = True)
df_Usage.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 49991484 entries, (5659128163, 2020-09-01 00:00:00) to (7150058108, 2020-01-01 00:00:00)
Data columns (total 5 columns):
BILLING_USAGE_QTY float64
BILLING_DAYS_CNT int64
TARIFF_RATE_TYP object
READ_FROM object
READ_TO object
dtypes: float64(1), int64(1), object(3)
memory usage: 2.1+ GB
20% reduction in memory