When to use Category rather than Object?

Question

I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.

My question is :

Should I use the dtype('category') of Pandas for the categorical features, or can I let the default dtype('object')?

No reason not to use a category here. Will also save a lot of space/memory if the strings are very long (you can check with `info()` or `memory_usage()` btw. Also 't' in dtype is not capitalized. — JohnE, Jun 02 '15 at 20:59

chrisaycock · Answer 1 · 2015-06-02T17:46:29.893

26

Use a category when there is lots of repetition that you expect to exploit.

For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default object is totally reasonable:

In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop

But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:

In [7]: trades['exch'] = trades['exch'].astype('category')

In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 µs per loop

Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.

edited Jun 02 '15 at 17:46

answered Jun 02 '15 at 16:50

chrisaycock

36,470
14
88
125

1

Thanks for your answers ! So Categorical type is better for memory optimization. – user4640449 Jun 02 '15 at 17:42
7

The other reason to use Categoricals, is that they *can* provide (as its not the default), an *ordering* to your categories. E.g. maybe ['small','medium','large']. Then you can sort by this! See the docs [here](http://pandas.pydata.org/pandas-docs/stable/categorical.html#sorting-and-order) – Jeff Jun 02 '15 at 20:13

score 24 · Answer 2 · answered Aug 01 '18 at 13:01

The Pandas documentation has a concise section on when to use the categoricaldata type:

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.

The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.

As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

What is the difference between object and categorical? Which is equivalent to R "factors"? — skan, Jul 17 '22 at 23:39

When to use Category rather than Object?

2 Answers2