5

I have a Pandas DataFrame that contains several string values. I want to replace them with integer values in order to calculate similarities. For example:

stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]: 
    CNPJ_Store_Code      region  total_facings
1    93209765046613   Geo RS/SC       1.471690
16   93209765046290   Geo RS/SC       1.385636
19   93209765044084  Geo PR/SPI       0.217054
21   93209765044831   Geo RS/SC       0.804633
23   93209765045218  Geo PR/SPI       0.708165

and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.

Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be. Any ideas? I am trying to use DictVectorizer, with no success.

I'm sure there's a way to do it in intelligent way, but I just can't find it.

Anyone familiar with a solution?

user3318421
  • 91
  • 3
  • 8
  • 1
    Does using a categorical dtype solve your problems? http://pandas-docs.github.io/pandas-docs-travis/categorical.html – firelynx Aug 06 '15 at 07:44
  • I solved the issue by using LabelEncoder() from sklearn. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html – user3318421 Aug 06 '15 at 08:37

3 Answers3

7

You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:

region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
4

It looks to me like you really would like panda categories

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

I think you just need to change the dtype of your text column to "category" and you are done.

stores['region'] = stores["region"].astype('category')
agomcas
  • 695
  • 5
  • 12
1

You can do:

df = pd.read_csv(filename, index_col = 0)  # Assuming it's a csv file.

def region_to_numeric(a):
    if a == 'Geo RS/SC':
        return 1
    if a == 'Geo PR/SPI':
        return 2


df['region_num'] = df['region'].apply(region_to_numeric)
Uttara
  • 125
  • 1
  • 1
  • 16