Operating on object-typed dataframes/arrays is slow because Pandas needs to operate on each item using the inefficient CPython interpreter. This causes a high overhead due to reference counting, internal pointer indirections, type checks, internal function calls, etc. Pandas often uses Numpy internally which can be much faster when the types are native like int64
, int32
, float64
, etc. In that case, Numpy can execute a optimized native code that is not slowed down by the CPython overheads and that can even benefit from hardware SIMD units (regarding the target function used). While Numpy supports bounded strings, Pandas does not use this but slow CPython string objects instead. Strings are inherently slow, even in native codes, because of their generally variable size that is often predictable (this strongly impacts the processor that need to predict branches so to be fast, see this post about branch prediction). In practice, unicode characters make strings even slower (it makes the use of SIMD instructions very difficult and branch even harder to predict). Categorial are basically integers associated with a mapping table (of unique values). Categorial columns can theoretically be faster for some computation because the table is already computed. However, the initial computation of the table can be expensive. Additionally, the table is not always used efficient where it could resulting sometimes to a surprisingly slower execution compared to integers. Not to mentions the table can be big when all the values are different. Integers are the less expensive type. Smaller integer can often be faster. Indeed, SIMD vectors have a fixed size (eg. the AVX-2 SIMD instruction set of 86-64 processors can compute 32 int8
value in a row compared to only 4 int64
). Furthermore, smaller items cause the whole columns to take less memory reducing the memory throughput so it improve the performance of memory-bound codes (starting from dataframe copies that are pretty frequent in Pandas). However, this is not always faster because smaller types can sometime cause type-conversion adding an additional overhead (though this overhead can be mitigated using lower-level optimizations). Thus, if you are working on huge dataframe, please consider using small integer types. Otherwise, int64
is certainly a very good option.