1

I am concerned with creating pandas dataframes with billions of rows. These dataframes are instantiated from a numpy array. The trick is that I need to make some columns into a categorical data type. I would like to do this as fast as possible. Currently, the creation of these categoricals is my bottleneck.

I am currently attempting to create the categoricals with fastpath=True.

Inside the __init__ of Categorical there is a function call codes = coerce_indexer_dtype(values, dtype.categories) (see: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/categorical.py, line 378)

I have data that I can format so I can skip this call (it is one of the primary offenders here).

The super().__init__(codes, dtype) call at the end of the fastpath block seems to prevent me from making an easy subclass of the Categorical class to override the behavior. Perhaps I'm missing something tho. I'm weary of subclassing a pandas class and screwing things up.

Would be very helpful if anyone had any feedback.

Here is a small code with the basics of what I'm doing:

import pandas as pd
import numpy as np

df = pd.DataFrame([(i, j) for i in range(1000) for j in range(1000)])
cats = list(range(1000))
dtype = pd.CategoricalDtype(categories=cats, ordered=True)
df[0] = pd.Categorical(
    values=df[0].to_numpy(dtype=np.int16), dtype=dtype, fastpath=True
)
df[1] = pd.Categorical(
    values=df[1].to_numpy(dtype=np.int16), dtype=dtype, fastpath=True
)
boxblox
  • 21
  • 4
  • _I am concerned with creating pandas dataframes with billions of rows... Currently, the creation of these categoricals is my bottleneck_ - I take your word for that. However, for others to be able to help you, you should post the complete code of your benchmark, the benchmark output and your interpretation of the benchmark results, which leads you to conclude that _the creation of these categoricals is my bottleneck_. Otherwise, that's just hand-waving and hearsay, rather than reproducible observations. – Maxim Egorushkin Jan 26 '23 at 02:25
  • In science, one doesn't state conclusions but omit measurement methods and observations the conclusions are based upon. To the contrary, one describes measurement methods and observations obtained, and a plausible interpretation. Hoping that anyone repeating your methods and obtaining similar observations (or not) arrives (or not) to the similar interpretation of observations. – Maxim Egorushkin Jan 26 '23 at 02:39
  • It takes 705.59 msec to run your code on my workstation. That neither supports nor denies your claim that _the creation of these categoricals is my bottleneck_. You also refer to something not present in the code you posted. Please follow https://stackoverflow.com/help/minimal-reproducible-example, or have your question closed due to lacking reproduction. – Maxim Egorushkin Jan 26 '23 at 03:05
  • What's the problem with `super().__init__(codes, dtype)` when overriding? – dankal444 Jan 26 '23 at 10:08
  • Creating the dataframe (`pd.DataFrame`) takes 0.5 second on my machine. This is normal for a pure-Python code creating a big dataframe. The rest of the code takes only 18 ms which is pretty fast. Indeed, you read 16 MiB of data (typically from RAM), then write 4 MiB due to the conversion and this one requires reads on x86-64 machines (and page-faults) so 8 MiB of RAM are read/written. This means 24 MiB in 8 ms without considering the categorial conversion which requires the opposite conversion. This means 48 MiB in 8 ms. Thus 6 GiB/s for a sequential conversion. This is reasonably fast. – Jérôme Richard Jan 26 '23 at 15:37
  • Note `df[1].to_numpy(...)` is slow when the column is already a categorial column so repeating the last lines is much slower than executing them once. This is certainly because Pandas does a copy in this case and convert the categorial values to 64-bit integers (possibly using a slow table). – Jérôme Richard Jan 26 '23 at 15:43

0 Answers0