Python Pandas - Changing some column types to categories

Question

I have fed the following CSV file into iPython Notebook:

public = pd.read_csv("categories.csv")
public

I've also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary - there are about 100 columns)

In [36]:   public.dtypes
Out[37]:   parks          object
           playgrounds    object
           sports         object
           roading        object               
           resident       int64
           children       int64

I want to change 'parks', 'playgrounds', 'sports' and 'roading' to categories (they have likert scale responses in them - each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.

I was able to create a separate dataframe - public1 - and change one of the columns to a category type using the following code:

public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')

However, when I tried to change a number at once using this code, I was unsuccessful:

public1 = {'parks': public.parks,
           'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

Notwithstanding this, I don't want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.

I tried numerous ways to achieve this, then tried the code here: Change column type in pandas.

public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

and got the following error:

 NotImplementedError: > 1 ndim Categorical are not supported at this time

Is there a way to change 'parks', 'playgrounds', 'sports', 'roading' to categories (so the likert scale responses can then be analysed), leaving 'resident' and 'children' (and the 94 other columns that are string, int + floats) untouched?

I am using Python 2.7.

unutbu · Accepted Answer · 2016-10-10T18:47:11.537

152

Sometimes, you just have to use a for-loop:

for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')

edited Oct 10 '16 at 18:47

answered Mar 07 '15 at 03:01

unutbu

842,883
184
1,785
1,677

3

Thank you so much @unutbu, this has worked a treat. I can't believe that it's that simple and I feel really stupid now! – gincard Mar 07 '15 at 03:09
4

what if I'm getting error? /Users/air/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy – Jan Sila Oct 10 '16 at 18:06
5

@JanSila: You may get that `UserWarning` if `public` is a sub-DataFrame of another DataFrame and has data which was *copied* from that other DataFrame. Out of an abundance of caution, Pandas emits a `UserWarning` to warn you that modifying `public` does not modify that other DataFrame. If modifying that other DataFrame is not what you intend to do or is not an issue, then you are free to ignore the UserWarning. If you wish to [silence the UserWarning](http://stackoverflow.com/a/38810015/190597) anyway, place `public.is_copy = False` before making assignments of the form `public[col] = ...`. – unutbu Oct 10 '16 at 18:54
@unutbu thanks, I didn't quite get it from the documentation. This makes it clear. Kind of like deep and shallow copy in C++ right? I casted the original variables and then copied. Or using.copy() when making the subset works as well, if anyone faces this issue as well. – Jan Sila Oct 10 '16 at 19:08
3

@JanSila: Yes, that's right. `public.copy()` also works, but note that if `public` is a large DataFrame, `public.copy()` could be much slower than setting the flag `public.is_copy = False`. (Moreover, the UserWarning is relevant only when `public` is a copy, so it seems ironic that we would need to make yet another copy just to silence the warning.) On the other hand, I don't think `public.is_copy = False` is documented. I found it by reading the source code. So if sticking to the documented API is a priority, you may wish to use `public = public.copy()`. – unutbu Oct 10 '16 at 19:21
@unutbu: note it has been vectorized (`df[sel_cols] = df[sel_cols].astype(new_type)`); and there is a typo in your nick;) – mirekphd Dec 26 '21 at 19:22

score 70 · Answer 2 · answered Jan 13 '17 at 22:04

70

You can use the pandas.DataFrame.apply method along with a lambda expression to solve this. In your example you could use

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

I don't know of a way to execute this inplace, so typically I'll end up with something like this:

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

Obviously you can replace .select_dtypes with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object types).

answered Jan 13 '17 at 22:04

Derek Kaknes

961
8
10

1

Excellent +1 ! Really pythonic and concise solution. Also great as it uses a query to obtain the desired columns instead of passing them as a hard-coded array. – DarkCygnus Sep 26 '17 at 18:11
1

df = df.apply(lambda s: s.astype('category') if s.name in ['parks', 'playgrounds', 'sports'] else s) – LePuppy Sep 17 '19 at 12:08
`df[categoricals] = df[categoricals].apply(lambda x: x.astype('category'))` – Milind Dalvi Aug 22 '20 at 23:06

score 69 · Answer 3 · edited Feb 11 '21 at 05:37

No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.

cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')

df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})

>>     a  b
>>  0  a  c
>>  1  b  d
>>  2  c  e

df.dtypes
>> a    object
>> b    object
>> dtype: object

df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a    category
>> b    category
>> dtype: object

score 14 · Answer 4 · answered Dec 16 '16 at 21:51

As of pandas 0.19.0, What's New describes that read_csv supports parsing Categorical columns directly. This answer applies only if you're starting from read_csv otherwise, I think unutbu's answer is still best. Example on 10,000 records:

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
            'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'resident' : np.random.choice([1, 2, 3], size=N),
            'children' : np.random.choice([0, 1, 2, 3], size=N)
                       })
categories.to_csv('categories_large.csv', index=False)

<0.19.0 (or >=19.0 without specifying dtype)

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

>=0.19.0

For mixed dtypes parsing as Categorical can be implemented by passing a dictionary dtype={'colname' : 'category', ...} in read_csv.

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                           'playgrounds': 'category',
                                           'sports': 'category',
                                           'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

Performance

A slight speed-up (local jupyter notebook), as mentioned in the release notes.

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop

score 8 · Answer 5 · edited Oct 13 '20 at 13:52

8

To make things easier. No apply. No map. No loop.

cols=data.select_dtypes(exclude='int').columns.to_list()
data[cols]=data[cols].astype('category')

edited Oct 13 '20 at 13:52

Turbcool

84
8

answered Dec 15 '19 at 04:05

liangli

1,151
9
11

That's what I was looking for. Thanks! – Aref Jan 17 '21 at 15:53

score 1 · Answer 6 · answered Feb 15 '21 at 05:32

Using list comprehension(avoiding loop), this would convert all colums with dtypes=object to dtypes=category. I've put 'df' as the dataframe to be more generic.

df[[col for col in df.columns if df[col].dtypes == object]].astype('category', copy=False)

In case you'd like to avoid "copy=False" argument for some reason (as python documentation tells us to be careful while using that), you may use the following line.

df[[col for col in df.columns if df[col].dtypes == object]] = df[[col for col in df.columns if df[col].dtypes == object]].astype('category')

This is my first answer on stack, so please be kind.

score 0 · Answer 7 · answered Jul 19 '18 at 17:47

0

I found that using a for loop works well.

for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
    dataframe_name[col] = dataframe_name[col].astype(float)

answered Jul 19 '18 at 17:47

NikoTumi

115
10

score -1 · Answer 8 · edited Jun 20 '20 at 09:12

Jupyter Notebook

In my case, I had big Dataframe with many objects that I would like to convert it to category.

Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in

# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
    obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()

I hope this might be a helpful resource for later reference

Python Pandas - Changing some column types to categories

8 Answers8

<0.19.0 (or >=19.0 without specifying dtype)

>=0.19.0

Performance

Jupyter Notebook

Linked