2

The title, pretty much.

I just want to know the best and most efficient way to OneHotEncode a column with like 2058 nuniques. Doing a fit_transform of said column, I know I will get an array of 2058 (minus 1 when you drop first) columns. Is it the right approach? Apart from that, I have another column that has about 441 nuniques, so that's another headache I need to take care of.

I know for a fact that the first column (the one with 2058 nuniques) is very important for the dataset. It's basically the brand names of cars, which in the real world is a deciding factor for someone to purchase the car or not; so I know it is important, but considering the dataset, I just want to exclude it due to the sheer unique values, and the fact that I'd have to OneHotEncode it.

So it just boils down to this: Is there another way to deal with these many unique values, or something else that I can do?

For the sake of this question:

  1. the column with 2058 nuniques = df['A']
  2. the column with 441 nuniques = df['B']
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47

1 Answers1

0

Maybe Target Encoding would be better suited than One Hot Encoding in that case

https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-encoder-a0f1733a4c69

endive1783
  • 827
  • 1
  • 8
  • 18
  • Thanks for writing. I went over the link you shared, and then read a bit more about Target Encoding also. What I found out was that it is used for situations where the Target is binary. In my case, it's continuous. It's basically the price of a used car. So how'd I go about it? When I do a target encoding anyway (using `TargetEncoder` from `category_encoders`), unsurprisingly, I get not the values between 0 and 1 (according the articles I've read, they need to be because they're all probability values) – Anonymous Person Apr 06 '22 at 08:56
  • It can work very well for continuous values, you just have to average the target (price of the car) over all the examples (of the car's brand). It is a way to get an "average price for brand X" from your dataset. Then for training your model, you can still normalize the target encoding column – endive1783 Apr 06 '22 at 09:34
  • Thanks. I think I understand what you're saying. If possible, can you share an article that talks about this? I'd like to take a look at it. – Anonymous Person Apr 07 '22 at 07:53
  • 1
    These two articles describe target encoding for continuous variables : https://medium.com/@shailypa/target-encoding-cd3e9c14fcc, https://brendanhasz.github.io/2019/03/04/target-encoding.html#target-encoding Hope it helps you – endive1783 Apr 07 '22 at 08:57
  • This kinda answers my question, so I mark it as the Answer. Thanks for your help! – Anonymous Person Apr 07 '22 at 15:18
  • Hi I have a follow up question on this: After I've Target encoded the variable, does plotting it's graph signify anything? Does it mean anything at all? Or does calculating its correlation with another signify anything? – Anonymous Person Apr 15 '22 at 12:52