I am working with a dataset on cell phone churn rates. I am attempting to create a dummy code for a column of state abbreviations in a dataset with a shape of 3333 rows × 20 columns. I need to leave out one of the state dummy coded columns to serve as the "reference" column for use in modeling. What I think should happen is a column should be created for each row, and a 1 put in place in the row that corresponds to the newly created dummy column. I am currently getting 0s in every row except the first row which is populated with all 1s. I need to somehow get the dummy variables to include a marker the the appropriate column for each row. I also think I should combine down the columns to only be unique columns (in this case one for each state), but I am not sure if that will mess with the point of dummy coding?
I currently have the following code:
1. Creating dummy variables for 'state' and excluding the first dummy column:
churn_dummies = pd.get_dummies(churn, columns='state', prefix='st').iloc[:,20:]
This returns a dataframe that is 3333x3332.
st_OH st_NJ st_OH st_OK st_AL st_MA st_MO st_LA st_WV st_IN st_RI
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
This result seems to continue through the entire gigantic dataframe that's created, and from spot checks, the rows don't seem to contain the appropriate 1's marked with their corresponding column. I've been using the following pandas doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
2. Then concatenating the columns onto the dataframe:
churn = pd.concat([churn, churn_dummies], axis=1)