-1

Should I encode a categorical column and use label encoding, then impute NaN values with most frequent value, or are there other ways?

As encoding requires converting dataframe to array, then imputing would require again array to dataframe conversion (all this for a single column, and there are more columns like that).

Fore example, I have the variable BsmtQual which evaluates the height of a basement and has following number of categories:

Ex  Excellent (100+ inches) 
Gd  Good (90-99 inches)
TA  Typical (80-89 inches)
Fa  Fair (70-79 inches)
Po  Poor (<70 inches
NA  No Basement

Out of 2919 values in BsmtQual, 81 are NaN values.

user4157124
  • 2,809
  • 13
  • 27
  • 42
Ekam Dz Singh
  • 27
  • 1
  • 7

2 Answers2

1

For problems you have in the future like this that don't involve coding you should post at https://datascience.stackexchange.com/.

This depends on a few things. First of all, how important is this variable in your exercise? Assuming that you are doing classification, you could try removing all rows without with NaN values, running a few models, then removing the variable and running the same models again. If you haven't seen a dip in accuracy, then you might consider removing the variable completely.

If you do see a dip in accuracy or can't judge impact due to the problem being unsupervised, then there are several other methods you can try. If you just want a quick fix, and if there aren't too many NaNs or categories, then you can just impute with the most frequent value. This shouldn't cause too many problems if the previous conditions are satisfied.

If you want to be more exact, then you could consider using the other variables you have to predict the class of the categorical variable (obviously this will only work if the categorical variable is correlated to some of your other variables). You could use a variety of algorithms for this, including classifiers or clustering. It all depends on the distribution of your categorical variable and how much effort you want to put it in to solve your issue.

(I'm only learning as well, however I think thats most of your options)

Laurie
  • 1,189
  • 1
  • 12
  • 28
0

"… or there are other ways."

Example:

Ex   Excellent (100+ inches) 5 / 5 = 1.0
Gd   Good (90-99 inches)     4 / 5 = 0.8
TA   Typical (80-89 inches)  3 / 5 = 0.6
Fa   Fair (70-79 inches)     2 / 5 = 0.4
Po   Poor (<70 inches        1 / 5 = 0.2
NA   No Basement             0 / 5 = 0.0

However, labels express less precision (affects accuracy if combined with actual measurements).

Could be solved by either scaling values over category range (e.g. scaling 0 - 69 inches over 0.0 - 0.2), or by approximation value for each category (more linearly accurate). For example, if highest value is 200 inch:

Ex   Excellent (100+ inches) 100                  / 200 = 0.5000
Gd   Good (90-99 inches)     ((99 - 90) / 2) + 90 / 200 = 0.4725
TA   Typical (80-89 inches)  ((89 - 80) / 2) + 80 / 200 = 0.4225
Fa   Fair (70-79 inches)     ((79 - 70) / 2) + 70 / 200 = 0.3725
Po   Poor (<70 inches        (69 / 2)             / 200 = 0.1725
NA   No Basement             0                    / 200 = 0.0000

Actual measurement 120 inch  120                  / 200 = 0.6000

Produces decent approximation (range mid-point value, except Ex, which is a minimum value). If calculations on such columns produce inaccuracies it is for notation imprecision (labels express ranges rather than values).

user4157124
  • 2,809
  • 13
  • 27
  • 42