How to make dummy coding (pd.get_dummies()) only for categories which share in nominal variables is at least 40% in Python Pandas?

Question

I have DataFrame like below:

COL1 | COL2 | COL3 | ...  | COLn
-----|------|------|------|----
111  | A    | Y    | ...  | ...
222  | A    | Y    | ...  | ...
333  | B    | Z    | ...  | ...
444  | C    | Z    | ...  | ...
555  | D    | P    | ...  | ...

And I need to do dummy coding (pandas.get_dummies()) only for categories which share in variable is at least 40%.

So, in COL2 only category "A" share in variable is at least 40%, in COL3 categories "Y" and "Z" share is at least 40% in variable.

So, as a result I need output like below:

COL1   | COL2_A  | COL3_Y  | COL_Z | ...  | COLn
-------|---------|---------|-------|------|-------
111    | 1       | 1       | 0     | ...  | ...
222    | 1       | 1       | 0     | ...  | ...
333    | 0       | 0       | 1     | ...  | ...
444    | 0       | 0       | 1     | ...  | ...
555    | 0       | 0       | 0     | ...  | ...

reproducible example of source df below:

df = pd.DataFrame()
df["COL1"] = [111,222,333,444,555]
df["COL2"] = ["A", "A", "B", "C", "D"]
df["COL3"] = ["Y", "Y", "Z", "Z", "P"]
df

Gilseung Ahn, I provided reproducible example at the end of question — dingaro, Jan 05 '23 at 00:32

score 3 · Accepted Answer · answered Jan 05 '23 at 00:47

Make dummy variables and remove some of them whose occurence ratio is smaller than 0.4 as follows.

import pandas as pd

df = pd.DataFrame()
df["COL1"] = [111,222,333,444,555]
df["COL2"] = ["A", "A", "B", "C", "D"]
df["COL3"] = ["Y", "Y", "Z", "Z", "P"]

origin_cols = df.columns
df = pd.get_dummies(df)
dummy_vars = [col for col in df.columns if (col.split('_')[0] in origin_cols) and ('_' in col)]

cond = df[dummy_vars].sum() / len(df) < 0.4
useless_cols = cond[cond].index
df = df.drop(useless_cols, axis = 1)

How to make dummy coding (pd.get_dummies()) only for categories which share in nominal variables is at least 40% in Python Pandas?

1 Answers1