-1

I have one-hot encoded a column 'postcode' and I want to see correlation between that and the wealth_segment which has been label encoded as: ( mass customer = 0, affluent customer = 1 and high net worth customer = 2).

I want to see if there is a correlation between the postcode and the wealth of the customer. The thing is, i have many columns of postcode because i have one hot encoded it. the naming convention is postcode_XXXX (XXXX being a 4 digit number)

What can I write to only find the correlation between these two variables? I have over 100 other columns in the dataframe so I do not want to simply go with the df.corr() method.

Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
Sana Shah
  • 5
  • 5
  • Correlation means co-movement of 2 variables. I do not think you may see this kind of relationship between zip code and wealth. You may though turn your zips into geolocations and see if some locations are wealthier than others. Or just sort your zips by total wealth of the residents. – Sergey Bushmanov Aug 24 '20 at 11:24

1 Answers1

1

If you just want the correlation values of each postcode column vs the wealth segment column, you can simply iterate over the column names containing postcode, filter the dataframe in each iteration and use df.corr() on the filtered dataframe.

Ex:

cols = [c for c in df.columns if c.startswith('postcode_')]

for col in cols:
    filter_df = df[[col, 'wealth_segment']]
    print(filter_df.corr())
jfaccioni
  • 7,099
  • 1
  • 9
  • 25