39

I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.

My question is :

Should I use the dtype('category') of Pandas for the categorical features, or can I let the default dtype('object')?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user4640449
  • 589
  • 1
  • 5
  • 9
  • 2
    No reason not to use a category here. Will also save a lot of space/memory if the strings are very long (you can check with `info()` or `memory_usage()` btw. Also 't' in dtype is not capitalized. – JohnE Jun 02 '15 at 20:59

2 Answers2

26

Use a category when there is lots of repetition that you expect to exploit.

For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default object is totally reasonable:

In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop

But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:

In [7]: trades['exch'] = trades['exch'].astype('category')

In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 µs per loop

Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.

chrisaycock
  • 36,470
  • 14
  • 88
  • 125
  • 1
    Thanks for your answers ! So Categorical type is better for memory optimization. – user4640449 Jun 02 '15 at 17:42
  • 7
    The other reason to use Categoricals, is that they *can* provide (as its not the default), an *ordering* to your categories. E.g. maybe ['small','medium','large']. Then you can sort by this! See the docs [here](http://pandas.pydata.org/pandas-docs/stable/categorical.html#sorting-and-order) – Jeff Jun 02 '15 at 20:13
24

The Pandas documentation has a concise section on when to use the categoricaldata type:

The categorical data type is useful in the following cases:

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
  • As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
willk
  • 3,727
  • 2
  • 27
  • 44
  • What is the difference between object and categorical? Which is equivalent to R "factors"? – skan Jul 17 '22 at 23:39