2

My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.

But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.

I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?

EDIT: My target variable is a simple binary one.

Ryan
  • 1,312
  • 3
  • 20
  • 40
  • Address is very specific and if it's super detailed i.e. "XXX St. Apt. 7B" you're essentially including an individual effect so your out of sample prediction is going to be garbage since you're going to fail terribly for an address you haven't seen already (very likely if specific). Also good luck estimating 53K effects. Aggregate the location up to coarser level, like neighborhood block, census tract, ZIP, County, State, depending upon the level of heterogeneity you expect. – ALollz Sep 07 '21 at 18:07
  • @ALollz The addresses are semi-anonimized blocks. "0000X E 100TH PL" for example. Right now, I'm locating the ones that have seen more than the median num of reported crimes (35) and just using those. This brings me to 29,900 and change. I'm predicting a specific crime, so leaving out location in formation from my RF or GBM model would be irresponsible to say the least. It's the Chicago Crime data set. So, lots and lots of data. – Ryan Sep 07 '21 at 18:14

1 Answers1

2

Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:

  1. One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
  2. Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
  3. If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
  4. Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)
TC Arlen
  • 1,442
  • 2
  • 11
  • 19
  • 1
    I'm going w/ choice 4 at the moment. See above comment to the other individual who responded. This is usually the fun and imaginative part of the work, but not when it crashes your session. – Ryan Sep 07 '21 at 18:17
  • 1
    @Ryan Imo you should consider option 3) as well. You could e.g. use word2vec from [gensim](https://radimrehurek.com/gensim/models/word2vec.html) to compress this feature into an n-dimensional vector (which is similar to option 1) but here you let neural network do the work). – Stefan Falk Sep 08 '21 at 06:29