16

I'm looking for a way to replicate the encode behaviour in Stata, which will convert a categorical string column into a number column.

x = pd.DataFrame({'cat':['A','A','B'], 'val':[10,20,30]})
x = x.set_index('cat')

Which results in:

     val
cat     
A     10
A     20
B     30

I'd like to convert the cat column from strings to integers, mapping each unique string to an (arbitrary) integer 1-to-1. It would result in:

     val
cat     
1     10
1     20
2     30

Or, just as good:

  cat  val
0   1   10
1   1   20
2   2   30

Any suggestions?

Many thanks as always, Rob

LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • maybe: DataFrame([(i[1], i[0]) for i in enumerate(set(x.index))]) and then merge? – lowtech Dec 16 '13 at 20:17
  • Important detail: this is **not** what Stata's `encode` does. It produces one-to-one mappings. – Nick Cox Dec 17 '13 at 00:46
  • @NickCox I don't understand how this isn't a one-to-one mapping. Each instance of `'A'` becomes `1`, each instance of `'B'` becomes `2` etc. – LondonRob Dec 17 '13 at 14:55
  • That's not what I see in your example. I see A, A, B mapping to 10, 20, 30. Why does the first A get 10 and the second get 20? If that's what you want, I don't understand but that's up to you; my point remains that it's not what `encode` does in Stata. – Nick Cox Dec 17 '13 at 15:08
  • @NickCox it's the `cat` column that's getting the mapping, not the `val` column. The `val` column remains unchanged and is of no relevance to the example. The important thing is that `cat` goes from `['A','A','B']` to `[1,1,2]` as per my example. – LondonRob Dec 17 '13 at 15:30
  • Glad to hear it, but I don't see that being clear anywhere in your post. – Nick Cox Dec 17 '13 at 15:35
  • Made the description of what I'm trying to do more explicit, in response to @NickCox's comments. – LondonRob Dec 17 '13 at 16:06

3 Answers3

17

You could use pd.factorize:

import pandas as pd

x = pd.DataFrame({'cat':('A','A','B'), 'val':(10,20,30)})
labels, levels = pd.factorize(x['cat'])
x['cat'] = labels
x = x.set_index('cat')
print(x)

yields

     val
cat     
0     10
0     20
1     30

You could add 1 to labels if you wish to replicate Stata's behaviour:

x['cat'] = labels+1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    Another way to get at [0,0,1] is to look in `pd.Categorical(seq).labels`. – DSM Dec 16 '13 at 20:14
  • Thanks, @DSM. Looking at [the source code](https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L79), I see `Categorical` calls `factorize`. – unutbu Dec 16 '13 at 20:18
  • Thanks @unutbu. FYI: this is a brilliant way to make beautiful categorised scatter plots, using a text column as the category. – LondonRob Dec 16 '13 at 20:26
  • 4
    @unutbu this should go in the docs, can you do a PR for somewhere around here: http://pandas.pydata.org/pandas-docs/dev/reshaping.html#computing-indicator-dummy-variables – Jeff Dec 16 '13 at 20:42
  • use the main repo; stable docs will be updated when 0.13 is released – Jeff Dec 17 '13 at 14:58
  • @Jeff: I've grepped my clone of the repo but could not find strings such as `dummy variables` or `bbacab` which are used on http://pandas.pydata.org/pandas-docs/dev/reshaping.html#computing-indicator-dummy-variables. What file in the repo should be edited to affect the docs? – unutbu Dec 17 '13 at 15:04
  • pandas/doc/source/reshape.rst – Jeff Dec 17 '13 at 15:14
9

Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

See documentation here.

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • 1
    I've been trying to do this for hours! Was searching convert object to integer, or convert categorical to numeric and going crazy. I'm on pandas 16.2 (current version with anaconda). – James Owers Dec 15 '15 at 04:46
  • +1000 `df['a'].cat.codes` is a lifesaver! Have been scouring the web to find as an alternative to using sklearn's DictVectorizer or LabelEncoder. This combined with OneHotEncoder works beautifully with sklearn-pandas – cmcapellan Dec 16 '15 at 04:58
1

Assuming you have the fixed set of single capitalized English letters as your categorical variable, you can also do this:

x['cat'] = x.cat.map(lambda x: ord(x) - 64)

I believe it is a bit of a hack. But then again, in Python, the best thing would be to define a mapping from characters to integers that you desire, such as

my_map = {"A":1, ...} 
# e.g.: {x:ord(x)-64  for x in string.ascii_uppercase}
# if that's the convention you happen to desire.

and then do

x['cat'] = x.cat.map(lambda x: my_map[x])

or something similar.

This is superior to reliance on the conventions of built-in functions for your integer mapping, for numerous reasons, and (IMO) it is things like this that "feel like" nuisance conversions to the programmer-analyst, but in reality represent important metadata about the software you are writing, that expose the real weakness of global convenience functions in higher level languages like MATLAB, STATA, etc. Even if there is a built-in function that happens to randomly adhere to the particular convention you want to use (the arbitrary convention that "A" is mapped to 1, "B" is mapped to 2, etc.) it doesn't make it a good idea to use it.

ely
  • 74,674
  • 34
  • 147
  • 228
  • I leave comments on MATLAB to experienced users. The comments on Stata's `encode` command are puzzling. It defaults to mapping distinct string values in alphabetical order to integers 1 up, so "A", "B", "C" would be mapped to 1, 2, 3. But that default can be overridden through some specified string to integer translation scheme. If you don't want that, don't use it; there's no discernible issue of language design or philosophy implied. – Nick Cox Dec 17 '13 at 00:51
  • `int64('A') == 65` in MATLAB. `int('A')` raises a `ValueError` in Python, which makes more sense IMHO. Of course, if you only write code in MATLAB that doesn't ever talk to the outside world, then it's a moot point. – Phillip Cloud Dec 17 '13 at 02:25
  • @Phillip Cloud I suppose it's a matter of taste as to whether someone expects `int` to behave that way. Since `int(x)` in Python is just syntactical sugar for `x.__int__()`, I don't see it the same way you do. I don't expect single-length `str` variables to have a different `__int__` than multi-character `str` variables, which provides the distinction for wanting a function like `ord`, but it's just my opinion. – ely Dec 17 '13 at 14:30
  • @EMS Your experience with Stata doesn't extend to being able to spell its name correctly or to know the difference between a Stata command and a Stata function. If length of experience is an argument, feel the weight of my 22 years with Stata. More seriously, and more importantly, your comments about `encode` remain puzzling, as you have changed your argument (really an assertion) to arguing that a language feature is indicted if used in ways you can consider dubious. That's more a reflection of your personal taste than anything else. – Nick Cox Dec 17 '13 at 14:40
  • I can only echo that as you have descended into criticising me, not my argument. – Nick Cox Dec 17 '13 at 15:09
  • @EMS I think you're mistaken. I agree with you w.r.t. the behavior of `__int__`. I supplied an example for @NickCox. Guess I should have mentioned that. – Phillip Cloud Dec 17 '13 at 16:24
  • Oh I see, my bad. I misread your comment as saying that the MATLAB behavior was more desirable. – ely Dec 17 '13 at 16:26