I am concerned with creating pandas dataframes with billions of rows. These dataframes are instantiated from a numpy array. The trick is that I need to make some columns into a categorical data type. I would like to do this as fast as possible. Currently, the creation of these categoricals is my bottleneck.
I am currently attempting to create the categoricals with fastpath=True
.
Inside the __init__
of Categorical
there is a function call codes = coerce_indexer_dtype(values, dtype.categories)
(see: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/categorical.py, line 378)
I have data that I can format so I can skip this call (it is one of the primary offenders here).
The super().__init__(codes, dtype)
call at the end of the fastpath
block seems to prevent me from making an easy subclass of the Categorical
class to override the behavior. Perhaps I'm missing something tho. I'm weary of subclassing a pandas class and screwing things up.
Would be very helpful if anyone had any feedback.
Here is a small code with the basics of what I'm doing:
import pandas as pd
import numpy as np
df = pd.DataFrame([(i, j) for i in range(1000) for j in range(1000)])
cats = list(range(1000))
dtype = pd.CategoricalDtype(categories=cats, ordered=True)
df[0] = pd.Categorical(
values=df[0].to_numpy(dtype=np.int16), dtype=dtype, fastpath=True
)
df[1] = pd.Categorical(
values=df[1].to_numpy(dtype=np.int16), dtype=dtype, fastpath=True
)