Binarizing pandas dataframe column

Question

mean radius mean texture    mean perimeter  mean area   mean smoothness mean compactness    mean concavity  mean concave points mean symmetry   mean fractal dimension  ... worst texture   worst perimeter worst area  worst smoothness    worst compactness   worst concavity worst concave points    worst symmetry  worst fractal dimension classification
0   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.14710 0.2419  0.07871 ... 17.33   184.60  2019.0  0.1622  0.6656  0.7119  0.2654  0.4601  0.11890 0
1   20.57   17.77   132.90  1326.0  0.08474 0.07864 0.0869  0.07017 0.1812  0.05667 ... 23.41   158.80  1956.0  0.1238  0.1866  0.2416  0.1860  0.2750  0.08902 0
2   19.69   21.25   130.00  1203.0  0.10960 0.15990 0.1974  0.12790 0.2069  0.05999 ... 25.53   152.50  1709.0  0.1444  0.4245  0.4504  0.2430  0.3613  0.08758 0
3   11.42   20.38   77.58   386.1   0.14250 0.28390 0.2414  0.10520 0.2597  0.09744 ... 26.50   98.87   567.7   0.2098  0.8663  0.6869  0.2575  0.6638  0.17300 0
4   20.29   14.34   135.10  1297.0  0.10030 0.13280 0.1980  0.10430 0.1809  0.05883 ... 16.67   152.20  1575.0  0.1374  0.2050  0.4000  0.1625  0.2364  0.07678 0

Suppose I have a pandas dataFrame that looks like above. I want to binarize (change to 0 or 1) of the mean radius column if it the value is higher than 12.0.

What I've tried is

data_df.loc[data_df["mean radius"] > 12.0] = 0

But this gave me a weird result.

How can I solve this?

pault · Accepted Answer · 2018-01-27T03:39:42.533

8

If you wanted to change the whole column to 1 and 0, you could modify your code slightly to:

# 0 if greater than 12, 1 otherwise
data_df["mean_radius"] = (data_df["mean radius"] <= 12.0).astype(int)

If you just wanted to change the columns where the radius was greater than 12 to 0 (leaving the values less than 12 unchanged):

# only change the values > 12
# this method is discouraged, see edit below
data_df[data_df["mean radius"] > 12.0]["mean radius"] = 0

Edit

As @jp_data_analysis pointed out, chained indexing is discouraged. The preferred way to do the second operation is multi-axis indexing, reproduced here from this answer below:

# only change the values > 12
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0

edited Jan 27 '18 at 03:39

answered Jan 27 '18 at 03:08

pault

41,343
15
107
149

1

Thanks for the answer. If I want to store the binarized result to a new column, do I simply have to do `data_df['new column'] = data_df[column_name] < cutoff` ? – Dawn17 Jan 27 '18 at 03:14
Yup exactly. But you'll probably have to call `.astype(int)` if you want 1's and 0's (see my update). `data_df[column_name] < cutoff` will return booleans (`True` and `False`). – pault Jan 27 '18 at 03:15
1

The Boolean -> int series method is good. But I would discourage chained indexing (see https://stackoverflow.com/a/41253181/9209546). – jpp Jan 27 '18 at 03:28
@jp_data_analysis thanks for the info! I'll edit the post. – pault Jan 27 '18 at 03:30

score 1 · Answer 2 · answered Jan 27 '18 at 03:02

1

Specify the column as well, as so:

data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0

answered Jan 27 '18 at 03:02

jpp

159,742
34
281
339

If I want to make the opposite case (less than 12.0) to 1, do I just have to write a new line with a different condition? – Dawn17 Jan 27 '18 at 03:04
1

@Dawn17 Yes. Alternatively, set it to 1 to begin with and just specify 0 condition. But only if these 2 scenarios cover all cases (e.g. no NaN). – jpp Jan 27 '18 at 03:07

score 0 · Answer 3 · answered Jan 27 '18 at 03:19

0

By using mask

data_df["mean radius"]=data_df["mean radius"].mask(data_df["mean radius"] > 12.0,0)

answered Jan 27 '18 at 03:19

BENY

317,841
20
164
234

score 0 · Answer 4 · answered Jul 30 '22 at 22:12

A better way to do this is to change the values to Boolean (TRUE and FALSE) and then multiply by 1 to binarize it into 1 for TRUE and 0 for FALSE. Here is how it is done:

data_df['mean_radius'] = (data_df['mean radius'] > 12.0)*1

print(data_df['mean_radius'])

This code will add a new column called mean_radius with binarized values. Let me know if this helps.

Binarizing pandas dataframe column

4 Answers4