How to deal with the categorical variable of more than 33 000 cities?

Question

I work in Python. I have a problem with the categorical variable - "city".

I'm building a predictive model on a large dataset-over 1 million rows. I have over 100 features. One of them is "city", consisting of 33 000 different cities.

I use e.g. XGBoost where I need to convert categorical variables into numeric. Dummifying causes the number of features to increase strongly. XGBoost (and my 20 gb RAM) can't handle this.

Is there any other way to deal with this variable than e.g. One Hot Encoding, dummies etc.? (When using One Hot Encoding e.g., I have performance problems, there are too many features in my model and I'm running out of memory.)

Is there any way to deal with this?

Can you be more specific? Show some code? There is no information what library you're using. — rpoleski, May 23 '20 at 17:34
I use xgboost. I'm forecasting the prices of the apartments. Currently I'm skipping the column with the cities, so I don't do any operations on it. I'd like to include it in the feature set, but I don't know how to deal with so many categories. The only solution I use is e.g. LGBM, which deals with categorical variables. — TigerJ, May 23 '20 at 17:47

score 2 · Answer 1 · answered May 15 '21 at 10:33

XGBoost has also since version 1.3.0 added experimental support for categorical encoding.

Copying my answer from another question.

Nov 23, 2020

XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:

1.8.7 Categorical Data

Other than users performing encoding, XGBoost has experimental support for categorical data using gpu_hist and gpu_predictor. No special operation needs to be done on input test data since the information about categories is encoded into the model during training.

https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf

In the DMatrix section the docs also say:

enable_categorical (boolean, optional) – New in version 1.3.0.

Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split. Also, JSON serialization format, gpu_predictor and pandas input are required.

Other models option:

If you don't need to use XGBoost, you can use a model like LightGBM or or CatBoost which support categorical encoding without one-hot-encoding out of the box.

score 1 · Answer 2 · answered May 23 '20 at 20:32

You could use some kind of embeddings that reflect better those cities (and compress the number of total features by direct OHE), maybe using some features to describe the continet where each city belongs, then some other features to describe the country/region, etc.

Note that since you didn't provide any specific detail about this task, I've used only geographical data on my example, but you could use some other variables related to each city, like the mean temprature, the population, the area, etc, depending on the task you are trying to address here.

Another approach could be replacing the city name with its coordinates (latitude and longitude). Again, this may be helpful depending on the task for your model.

Hope this helps

@TigerJ Does this answer your question? – alan.elkin May 28 '20 at 17:30 — alan.elkin, May 28 '20 at 17:30

score 1 · Answer 3 · answered Oct 13 '21 at 06:40

Beside the models, you could also decrease the number of the features (cities) by grouping them in geographical regions. Another option is grouping them by population size.

Another option is grouping them by their frequency by using quantile bins. Target encoding might be another option for you.

Feature engineering in many cases involves a lot of manual work, unfortunately you cannot always have everything sorted out automatically.

score 0 · Answer 4 · answered Dec 02 '22 at 16:33

There are already great responses here.

Other technique I would use is cluster those cities into groups using K-means clustering with some of the features specific to cities in your dataset.

By this way you could use the cluster number in place of the actual city. This could reduce the number of levels quite a bit.

How to deal with the categorical variable of more than 33 000 cities?

4 Answers4