I have DataFrame like below:
COL1 | COL2 | COL3 | ... | COLn
-----|------|------|------|----
111 | A | Y | ... | ...
222 | A | Y | ... | ...
333 | B | Z | ... | ...
444 | C | Z | ... | ...
555 | D | P | ... | ...
And I need to do dummy coding (pandas.get_dummies()
) only for categories which share in variable is at least 40%.
So, in COL2 only category "A" share in variable is at least 40%, in COL3 categories "Y" and "Z" share is at least 40% in variable.
So, as a result I need output like below:
COL1 | COL2_A | COL3_Y | COL_Z | ... | COLn
-------|---------|---------|-------|------|-------
111 | 1 | 1 | 0 | ... | ...
222 | 1 | 1 | 0 | ... | ...
333 | 0 | 0 | 1 | ... | ...
444 | 0 | 0 | 1 | ... | ...
555 | 0 | 0 | 0 | ... | ...
reproducible example of source df below:
df = pd.DataFrame()
df["COL1"] = [111,222,333,444,555]
df["COL2"] = ["A", "A", "B", "C", "D"]
df["COL3"] = ["Y", "Y", "Z", "Z", "P"]
df