3

I am following a course on deep learning and I have a model built with keras. After data preprocessing and encoding of categorical data, I get an array of shape (12500,) as the input to the model. This input makes the model training process slower and laggy. Is there an approach to minimize the dimensionality of the inputs?

Inputs are categorised geo coordinates, weather info, time, distance and I am trying to predict the travel time between two geo coordinates.

Original dataset has 8 features and 5 of them are categorical. I used onehot encoding to encode the above categorical data. geo coordinates have 6000 categories, weather 15 categories time has 96 categories. Likewise all together after encoding with onehot encoding I got an array of shape (12500,) as the input to model.

Maxim
  • 52,561
  • 27
  • 155
  • 209
Klaus
  • 1,641
  • 1
  • 10
  • 22
  • What does the input represent? Do you need everything in the input? What are you trying to output? You need to provide more information to get any sort of meaningful answer. – Primusa Apr 15 '18 at 02:38
  • more information on the inputs are added – Klaus Apr 15 '18 at 03:02
  • how does that require an array of 12500? I count five features! – Primusa Apr 15 '18 at 03:05
  • Original dataset has 8 features and 5 of them are categorical. I used onehot encoding to encode the above categorical data. geo coordinates have 6000 categories, weather 15 categories time has 96 categories. Likewise all together after encoding with onehot encoding I got an array of shape(12500,) as the input to model. – Klaus Apr 15 '18 at 03:31

2 Answers2

3

When the number of categories is large, one-hot encoding becomes too inefficient. The extreme example of this is processing of sentences in a natural language: in this task the vocabulary often has 100k or even more words. Obviously the translation of a 10-word sentence into a [10, 100000] matrix, almost all of which is zero, would be a waste of memory.

What the researches use instead is the embedding layer, which learns a dense representation of a categorical feature. In case of words, it's called word embedding, e.g. word2vec. This representation is much smaller, something like 100-dimensional, and makes the rest of the network to work efficiently with 100-d input vectors, rather than 100000-d vectors.

In keras, it's implemented by an Embedding layer, which I think would work perfectly for your geo and time features, while others may probably work fine with one-hot encoding. This means that your model is no longer Sequential, but rather has several inputs, some of which go through the embedding layer. The main model would take the concatenation of learned representations and do the regression inference.

Maxim
  • 52,561
  • 27
  • 155
  • 209
1

You can use PCA to do dimensionality reduction. It removes co-related variables and makes sure that high variances exits in the data.

Wikipedia PCA

Analytical Vidya PCA

Shashi Tunga
  • 496
  • 1
  • 6
  • 24
  • Can PCA be applied to the dataset after encoding with one hot encoder?. original dataset without encoding contains 8 features and after encoding input gets to a shape of (15000,) – Klaus Apr 15 '18 at 03:36
  • I think applying PCA before encoding wont make any difference as after you encode it will result once again to (15000,).So,Do it after encoding only. – Shashi Tunga Apr 15 '18 at 03:53