6

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:

What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?

Which one comes first and why?

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Newbie
  • 91
  • 2
  • 5

1 Answers1

15

Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.

For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.

1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday

Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.

The example in wikipedia is a good one. Suppose your have three documents:

  • John likes to watch movies.
  • Mary likes movies too.
  • John also likes football.

Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).

enter image description here

The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.

Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.

Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).

       bucket1 bucket2  bucket3
doc1:    3         2        0
doc2:    2         2        0
doc3:    1         0        2

Now you successfully transformed the features in 9-dimensions to 3-dimensions.

A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.

Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:

enter image description here

  • Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
  • The bag of words model now holds the common keywords and also use-specific keywords.
  • All words are then hashed into a low dimensioanl feature space where the document is trained and classified.

Now to answer your questions:

Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models. You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

greeness
  • 15,956
  • 5
  • 50
  • 80
  • Thanks for the answer. I read the paper on spam classification and it still appears that each user has their own model, although with the compressed feature space. Is that what you meant to indicate as well? – sandyp May 25 '22 at 18:20