6
test = {'ngrp' : ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']}
test = pd.DataFrame(test)
dummy = pd.get_dummies(test['ngrp'], drop_first = True)

This gives me:

   Brooklyn  Manhattan  Queens  Staten Island
0         0          1       0              0
1         1          0       0              0
2         0          0       1              0
3         0          0       0              1
4         0          0       0              0

I will get Bronx as my reference level (because that is what gets dropped), how do I change it to specify that Manhattan should be my reference level? My expected output is

   Brooklyn  Queens  Staten Island  Bronx
0         0       0              0      0
1         1       0              0      0
2         0       1              0      0
3         0       0              1      0
4         0       0              0      1
cs95
  • 379,657
  • 97
  • 704
  • 746
John peter
  • 144
  • 1
  • 11

1 Answers1

2

get_dummies sorts your values (lexicographically) and then creates dummies. That's why you don't see "Bronx" in your initial result; its because it was the first sorted value in your column, so it was dropped first.

To avoid the behavior you see, enforce the ordering to be on a "first-seen" basis (i.e., convert it to an ordered categorical).

pd.get_dummies(
    pd.Categorical(test['ngrp'], categories=test['ngrp'].unique(), ordered=True), 
    drop_first=True)                                       

   Brooklyn  Queens  Staten Island  Bronx
0         0       0              0      0
1         1       0              0      0
2         0       1              0      0
3         0       0              1      0
4         0       0              0      1

Of course, this has the side effect of returning dummies with categorical column names as the result, but that's almost never an issue.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • What if I would like to pick a specific category, for example Staten Island? Then it won't be on a 'first-seen' basis anymore. – John peter Nov 15 '19 at 06:10
  • @leecolin your question doesn't indicate that that could be a possible case? This would still work, you would just need to change the argument to categories as appropriate. – cs95 Nov 15 '19 at 06:15